Performance modeling of graph processing computing architectures

ABSTRACT

A distributed simulation system is provided that includes a timing simulator and functional simulator(s) on different computing nodes to simulate a graph processing system. The functional simulators are to simulate execution of a set of instructions on the graph processing system and to send information associated with the simulated set of instructions to the timing simulator over the network. The timing simulator is to determine timing information associated with execution of the sets of instructions sent by the functional simulators and send the timing information to the functional simulators over the network. The timing simulator may determine a global synchronization point for the functional simulators and send the timing information for the sets of instructions to respective functional simulators at the global synchronization point. The functional simulators may stall simulation of further instructions until the timing information for its set of instructions is received from the timing simulator.

RELATED APPLICATIONS

This application claims benefit to U.S. Provisional Patent Application Ser. No. 63/295,280, filed Dec. 30, 2021, which is incorporated by reference herein in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT

This invention was made with Government support under Agreement No. HR0011-17-3-0004, awarded by Defense Advanced Research Projects Agency (DARPA). The Government has certain rights in the invention.

BACKGROUND

A datacenter may include one or more platforms each including at least one processor and associated memory modules. Each platform of the datacenter may facilitate the performance of any suitable number of processes associated with various applications running on the platform. These processes may be performed by the processors and other associated logic of the platforms. Each platform may additionally include I/O controllers, such as network adapter devices, which may be used to send and receive data on a network for use by the various applications.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of components of an example datacenter.

FIG. 2A is a simplified block diagram illustrating an example graph processing core,

FIG. 2B is a simplified block diagram illustrating an example graph processing device.

FIG. 3A is a simplified block diagram illustrating a simplified example of a graph structure.

FIG. 3B is a simplified block diagram illustrating a representation of an example access stream using an example graph structure.

FIG. 4 is a simplified block diagram illustrating example components of an example graph processing core.

FIG. 5 is a diagram illustrating example operations of an example graphic processing core offload engine.

FIG. 6 is a simplified block diagram illustrating an example implementation of a graph processing system including both graph processing cores and dense compute cores.

FIG. 7 is a simplified block diagram illustrating an example system including graph processing capabilities.

FIG. 8 is a simplified block diagram illustrating an example dense compute core.

FIG. 9 is a simplified block diagram illustrating an example dense offload queue associated with a dense compute core.

FIG. 10 is a simplified flow diagram illustrating a simplified flow diagram illustrating example flows involved in the offloading of functions from a graph processing core.

FIG. 11 illustrates an example system simulator with separate timing and functional simulators in accordance with one or more embodiments.

FIG. 12 illustrates an example instruction communication flow between functional simulators and a timing simulator in accordance with one or more embodiments.

FIG. 13 illustrates an example instruction communication flow between a multi-threaded functional simulator and a timing simulator in accordance with one or more embodiments.

FIG. 14 illustrates a flow diagram of an example simulation process implemented by a functional simulator in accordance with one or more embodiments.

FIG. 15 illustrates a flow diagram of an example simulation process implemented by a timing simulator in accordance with one or more embodiments.

FIG. 16 is a block diagram of a system.

FIG. 17 is a block diagram of a more specific example system.

FIG. 18, shown is a block diagram of a second more specific example system.

FIG. 19, shown is a block diagram of a system on a chip (SoC).

FIG. 20 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics.

FIG. 21 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

FIG. 1 illustrates a block diagram of components of a datacenter 100 in accordance with certain embodiments. In the embodiment depicted, datacenter 100 includes a plurality of platforms 102 (e.g., 102A, 102B, 102C, etc.), data analytics engine 104, and datacenter management platform 106 coupled together through network 108. A platform 102 may include platform logic 110 with one or more central processing units (CPUs) 112 (e.g., 112A, 112B, 112C, 112D), memories 114 (which may include any number of different modules), chipsets 116 (e.g., 116A, 116B), communication interfaces 118, and any other suitable hardware and/or software to execute a hypervisor 120 or other operating system capable of executing processes associated with applications running on platform 102. In some embodiments, a platform 102 may function as a host platform for one or more guest systems 122 that invoke these applications.

Each platform 102 may include platform logic 110. Platform logic 110 includes, among other logic enabling the functionality of platform 102, one or more CPUs 112, memory 114, one or more chipsets 116, and communication interface 118. Although three platforms are illustrated, datacenter 100 may include any suitable number of platforms. In various embodiments, a platform 102 may reside on a circuit board that is installed in a chassis, rack, composable servers, disaggregated servers, or other suitable structures that includes multiple platforms coupled together through network 108 (which may include, e.g., a rack or backplane switch).

CPUs 112 may each include any suitable number of processor cores. The cores may be coupled to each other, to memory 114, to at least one chipset 116, and/or to communication interface 118, through one or more controllers residing on CPU 112 and/or chipset 116. In particular embodiments, a CPU 112 is embodied within a socket that is permanently or removably coupled to platform 102. CPU 112 is described in further detail below in connection with FIG. 4. Although four CPUs are shown, a platform 102 may include any suitable number of CPUs.

Memory 114 may include any form of volatile or non-volatile memory including, without limitation, magnetic media (e.g., one or more tape drives), optical media, random access memory (RAM), read-only memory (ROM), flash memory, removable media, or any other suitable local or remote memory component or components. Memory 114 may be used for short, medium, and/or long-term storage by platform 102. Memory 114 may store any suitable data or information utilized by platform logic 110, including software embedded in a computer readable medium, and/or encoded logic incorporated in hardware or otherwise stored (e.g., firmware). Memory 114 may store data that is used by cores of CPUs 112. In some embodiments, memory 114 may also include storage for instructions that may be executed by the cores of CPUs 112 or other processing elements (e.g., logic resident on chipsets 116) to provide functionality associated with components of platform logic 110. Additionally or alternatively, chipsets 116 may each include memory that may have any of the characteristics described herein with respect to memory 114. Memory 114 may also store the results and/or intermediate results of the various calculations and determinations performed by CPUs 112 or processing elements on chipsets 116. In various embodiments, memory 114 may include one or more modules of system memory coupled to the CPUs through memory controllers (which may be external to or integrated with CPUs 112). In various embodiments, one or more particular modules of memory 114 may be dedicated to a particular CPU 112 or other processing device or may be shared across multiple CPUs 112 or other processing devices.

A platform 102 may also include one or more chipsets 116 including any suitable logic to support the operation of the CPUs 112. In some cases, chipsets 116 may be implementations of graph processing devices, such as discussed herein. In various embodiments, chipset 116 may reside on the same package as a CPU 112 or on one or more different packages. Each chipset may support any suitable number of CPUs 112. A chipset 116 may also include one or more controllers to couple other components of platform logic 110 (e.g., communication interface 118 or memory 114) to one or more CPUs. Additionally or alternatively, the CPUs 112 may include integrated controllers. For example, communication interface 118 could be coupled directly to CPUs 112 via one or more integrated I/O controllers resident on each CPU.

Chipsets 116 may each include one or more communication interfaces 128 (e.g., 128A, 128B). Communication interface 128 may be used for the communication of signaling and/or data between chipset 116 and one or more I/O devices, one or more networks 108, and/or one or more devices coupled to network 108 (e.g., datacenter management platform 106 or data analytics engine 104). For example, communication interface 128 may be used to send and receive network traffic such as data packets. In a particular embodiment, communication interface 128 may be implemented through one or more I/O controllers, such as one or more physical network interface controllers (NICs), also known as network interface cards or network adapters. An I/O controller may include electronic circuitry to communicate using any suitable physical layer and data link layer standard such as Ethernet (e.g., as defined by an IEEE 802.3 standard), Fibre Channel, InfiniBand, Wi-Fi, or other suitable standard. An I/O controller may include one or more physical ports that may couple to a cable (e.g., an Ethernet cable). An I/O controller may enable communication between any suitable element of chipset 116 (e.g., switch 130 (e.g., 130A, 130B)) and another device coupled to network 108. In some embodiments, network 108 may include a switch with bridging and/or routing functions that is external to the platform 102 and operable to couple various I/O controllers (e.g., NICs) distributed throughout the datacenter 100 (e.g., on different platforms) to each other. In various embodiments an I/O controller may be integrated with the chipset (i.e., may be on the same integrated circuit or circuit board as the rest of the chipset logic) or may be on a different integrated circuit or circuit board that is electromechanically coupled to the chipset. In some embodiments, communication interface 128 may also allow I/O devices integrated with or external to the platform (e.g., disk drives, other NICs, etc.) to communicate with the CPU cores.

Switch 130 may couple to various ports (e.g., provided by NICs) of communication interface 128 and may switch data between these ports and various components of chipset 116 according to one or more link or interconnect protocols, such as Peripheral Component Interconnect Express (PCIe), Compute Express Link (CXL), HyperTransport, GenZ, OpenCAPI, and others, which may each alternatively or collectively apply the general principles and/or specific features discussed herein. Switch 130 may be a physical or virtual (i.e., software) switch.

Platform logic 110 may include an additional communication interface 118. Similar to communication interface 128, this additional communication interface 118 may be used for the communication of signaling and/or data between platform logic 110 and one or more networks 108 and one or more devices coupled to the network 108. For example, communication interface 118 may be used to send and receive network traffic such as data packets. In a particular embodiment, communication interface 118 includes one or more physical I/O controllers (e.g., NICs). These NICs may enable communication between any suitable element of platform logic 110 (e.g., CPUs 112) and another device coupled to network 108 (e.g., elements of other platforms or remote nodes coupled to network 108 through one or more networks). In particular embodiments, communication interface 118 may allow devices external to the platform (e.g., disk drives, other NICs, etc.) to communicate with the CPU cores. In various embodiments, NICs of communication interface 118 may be coupled to the CPUs through I/O controllers (which may be external to or integrated with CPUs 112). Further, as discussed herein, I/O controllers may include a power manager 125 to implement power consumption management functionality at the I/O controller (e.g., by automatically implementing power savings at one or more interfaces of the communication interface 118 (e.g., a PCIe interface coupling a NIC to another element of the system), among other example features.

Platform logic 110 may receive and perform any suitable types of processing requests. A processing request may include any request to utilize one or more resources of platform logic 110, such as one or more cores or associated logic. For example, a processing request may include a processor core interrupt; a request to instantiate a software component, such as an I/O device driver 124 or virtual machine 132 (e.g., 132A, 132B); a request to process a network packet received from a virtual machine 132 or device external to platform 102 (such as a network node coupled to network 108); a request to execute a workload (e.g., process or thread) associated with a virtual machine 132, application running on platform 102, hypervisor 120 or other operating system running on platform 102; or other suitable request.

In various embodiments, processing requests may be associated with guest systems 122. A guest system may include a single virtual machine (e.g., virtual machine 132A or 132B) or multiple virtual machines operating together (e.g., a virtual network function (VNF) 134 or a service function chain (SFC) 136). As depicted, various embodiments may include a variety of types of guest systems 122 present on the same platform 102.

A virtual machine 132 may emulate a computer system with its own dedicated hardware. A virtual machine 132 may run a guest operating system on top of the hypervisor 120. The components of platform logic 110 (e.g., CPUs 112, memory 114, chipset 116, and communication interface 118) may be virtualized such that it appears to the guest operating system that the virtual machine 132 has its own dedicated components.

A virtual machine 132 may include a virtualized NIC (vNIC), which is used by the virtual machine as its network interface. A vNIC may be assigned a media access control (MAC) address, thus allowing multiple virtual machines 132 to be individually addressable in a network.

In some embodiments, a virtual machine 132B may be paravirtualized. For example, the virtual machine 132B may include augmented drivers (e.g., drivers that provide higher performance or have higher bandwidth interfaces to underlying resources or capabilities provided by the hypervisor 120). For example, an augmented driver may have a faster interface to underlying virtual switch 138 for higher network performance as compared to default drivers.

VNF 134 may include a software implementation of a functional building block with defined interfaces and behavior that can be deployed in a virtualized infrastructure. In particular embodiments, a VNF 134 may include one or more virtual machines 132 that collectively provide specific functionalities (e.g., wide area network (WAN) optimization, virtual private network (VPN) termination, firewall operations, load-balancing operations, security functions, etc.). A VNF 134 running on platform logic 110 may provide the same functionality as traditional network components implemented through dedicated hardware. For example, a VNF 134 may include components to perform any suitable NFV workloads, such as virtualized Evolved Packet Core (vEPC) components, Mobility Management Entities, 3rd Generation Partnership Project (3GPP) control and data plane components, etc.

SFC 136 is group of VNFs 134 organized as a chain to perform a series of operations, such as network packet processing operations. Service function chaining 136 may provide the ability to define an ordered list of network services (e.g. firewalls, load balancers) that are stitched together in the network to create a service chain.

A hypervisor 120 (also known as a virtual machine monitor) may include logic to create and run guest systems 122. The hypervisor 120 may present guest operating systems run by virtual machines with a virtual operating platform (i.e., it appears to the virtual machines that they are running on separate physical nodes when they are actually consolidated onto a single hardware platform) and manage the execution of the guest operating systems by platform logic 110. Services of hypervisor 120 may be provided by virtualizing in software or through hardware assisted resources that require minimal software intervention, or both. Multiple instances of a variety of guest operating systems may be managed by the hypervisor 120. Each platform 102 may have a separate instantiation of a hypervisor 120.

Hypervisor 120 may be a native or bare-metal hypervisor that runs directly on platform logic 110 to control the platform logic and manage the guest operating systems. Alternatively, hypervisor 120 may be a hosted hypervisor that runs on a host operating system and abstracts the guest operating systems from the host operating system. Various embodiments may include one or more non-virtualized platforms 102, in which case any suitable characteristics or functions of hypervisor 120 described herein may apply to an operating system of the non-virtualized platform.

Hypervisor 120 may include a virtual switch 138 that may provide virtual switching and/or routing functions to virtual machines of guest systems 122. The virtual switch 138 may include a logical switching fabric that couples the vNICs of the virtual machines 132 to each other, thus creating a virtual network through which virtual machines may communicate with each other. Virtual switch 138 may also be coupled to one or more networks (e.g., network 108) via physical NICs of communication interface 118 so as to allow communication between virtual machines 132 and one or more network nodes external to platform 102 (e.g., a virtual machine running on a different platform 102 or a node that is coupled to platform 102 through the Internet or other network). Virtual switch 138 may include a software element that is executed using components of platform logic 110. In various embodiments, hypervisor 120 may be in communication with any suitable entity (e.g., a SDN controller) which may cause hypervisor 120 to reconfigure the parameters of virtual switch 138 in response to changing conditions in platform 102 (e.g., the addition or deletion of virtual machines 132 or identification of optimizations that may be made to enhance performance of the platform).

Hypervisor 120 may include any suitable number of I/O device drivers 124. I/O device driver 124 represents one or more software components that allow the hypervisor 120 to communicate with a physical I/O device. In various embodiments, the underlying physical I/O device may be coupled to any of CPUs 112 and may send data to CPUs 112 and receive data from CPUs 112. The underlying I/O device may utilize any suitable communication protocol, such as PCI, PCIe, Universal Serial Bus (USB), Serial Attached SCSI (SAS), Serial ATA (SATA), InfiniBand, Fibre Channel, an IEEE 802.3 protocol, an IEEE 802.11 protocol, or other current or future signaling protocol.

The underlying I/O device may include one or more ports operable to communicate with cores of the CPUs 112. In one example, the underlying I/O device is a physical NIC or physical switch. For example, in one embodiment, the underlying I/O device of I/O device driver 124 is a NIC of communication interface 118 having multiple ports (e.g., Ethernet ports).

In other embodiments, underlying I/O devices may include any suitable device capable of transferring data to and receiving data from CPUs 112, such as an audio/video (A/V) device controller (e.g., a graphics accelerator or audio controller); a data storage device controller, such as a flash memory device, magnetic storage disk, or optical storage disk controller; a wireless transceiver; a network processor; or a controller for another input device such as a monitor, printer, mouse, keyboard, or scanner; or other suitable device.

In various embodiments, when a processing request is received, the I/O device driver 124 or the underlying I/O device may send an interrupt (such as a message signaled interrupt) to any of the cores of the platform logic 110. For example, the I/O device driver 124 may send an interrupt to a core that is selected to perform an operation (e.g., on behalf of a virtual machine 132 or a process of an application). Before the interrupt is delivered to the core, incoming data (e.g., network packets) destined for the core might be cached at the underlying I/O device and/or an I/O block associated with the CPU 112 of the core. In some embodiments, the I/O device driver 124 may configure the underlying I/O device with instructions regarding where to send interrupts.

In some embodiments, as workloads are distributed among the cores, the hypervisor 120 may steer a greater number of workloads to the higher performing cores than the lower performing cores. In certain instances, cores that are exhibiting problems such as overheating or heavy loads may be given less tasks than other cores or avoided altogether (at least temporarily). Workloads associated with applications, services, containers, and/or virtual machines 132 can be balanced across cores using network load and traffic patterns rather than just CPU and memory utilization metrics.

The elements of platform logic 110 may be coupled together in any suitable manner. For example, a bus may couple any of the components together. A bus may include any known interconnect, such as a multi-drop bus, a mesh interconnect, a ring interconnect, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g. cache coherent) bus, a layered protocol architecture, a differential bus, or a Gunning transceiver logic (GTL) bus.

Elements of the data system 100 may be coupled together in any suitable, manner such as through one or more networks 108. A network 108 may be any suitable network or combination of one or more networks operating using one or more suitable networking protocols. A network may represent a series of nodes, points, and interconnected communication paths for receiving and transmitting packets of information that propagate through a communication system. For example, a network may include one or more firewalls, routers, switches, security appliances, antivirus servers, or other useful network devices. A network offers communicative interfaces between sources and/or hosts, and may include any local area network (LAN), wireless local area network (WLAN), metropolitan area network (MAN), Intranet, Extranet, Internet, wide area network (WAN), virtual private network (VPN), cellular network, or any other appropriate architecture or system that facilitates communications in a network environment. A network can include any number of hardware or software elements coupled to (and in communication with) each other through a communications medium. In various embodiments, guest systems 122 may communicate with nodes that are external to the datacenter 100 through network 108.

Current practices in data analytics and artificial intelligence perform tasks such as object classification on unending streams of data. Computing infrastructure for classification is predominantly oriented toward “dense” compute, such as matrix computations. The continuing exponential growth in generated data has shifted some compute to be offloaded to GPUs and other application-focused accelerators across multiple domains that are dense-compute dominated. However, the next step in the evolution in both artificial intelligence (AI), machine learning, and data analytics is reasoning about the relationships between these classified objects. In some implementations, a graph structure (or data structure) may be defined and utilized to define relationships between classified objects. For instance, determining the relationships between entities in a graph is the basis of graph analytics. Graph analytics poses important challenges on existing processor architectures due to its sparse structure.

High-performance large scale graph analytics is essential to timely analyze relationships in big data sets. The combination of low performance and very large graph sizes has traditionally limited the usability of graph analytics. Indeed, conventional processor architectures suffer from inefficient resource usage and bad scaling on graph workloads. Recognizing both the increasing importance of graph analytics and the need for vastly improved sparse computation performance compared to traditional approaches, an improved system architecture is presented herein that is adapted to performing high-performance graph processing by addressing constraints across the network, memory, and compute architectures that typically limit performance on graph workloads.

FIG. 2A is a simplified block diagram 200 a representing the general architecture of an example graph processing core 205. While a graph processing core 205, as discussed herein, may be particularly adept, at an architectural level, at handling workloads to implement graph-based algorithms, it should be appreciated that the architecture of a graph processing core 205 may handle any program developed to utilize its architecture and instruction set, including programs entirely unrelated to graph processing. Indeed, a graph processing core (e.g., 205) may adopt an architecture configured to provide massive multithreading and enhanced memory efficiency to minimize latency to memory and hide remaining latency to memory. Indeed, the high input/output (I/O) and memory bandwidth of the architecture enable the graph processing core 205 to be deployed in a variety of applications where memory efficiency is at a premium and memory bandwidth requirements made by the application are prohibitively demanding to traditional processor architectures. Further, the architecture of the graph processing core 205 may realize this enhanced memory efficiency by granularizing its memory accesses in relatively small, fixed chunks (e.g., 8B random access memory), equipping the cores with networking capabilities optimized for corresponding small transactions, and providing extensive multi-threading.

In the example of FIG. 2A, an example graph processing core 205 may include a number of multi-threaded pipelines or cores (MTCs) (e.g., 215 a-d) and a number of single-threaded pipelines or cores (e.g., 220 a-b). In some implementations, the MTCs and STCs may architecturally the same, but for the ability of the MTCs to support multiple concurrent threads and switching between these threads. For instance, respective MTC and STC may have 32 registers per thread, all state address map, and utilize a common instruction set architecture (ISA). In one example, the pipeline/core ISAs may be Reduced Instruction Set Computer (RISC)-based, fixed length instructions.

In one example, respective MTC (e.g., 215 a-d) may support sixteen threads with only minimal interrupt handling. For instance, each thread in an MTC may execute a portion of a respective instruction, with the MTC switching between the active threads automatically or opportunistically (e.g., switch from executing one thread to the next in response to a load operation by the first thread so as to effectively hide the latency of the load operation (allowing the other thread or threads to operate during the cycles needed for the load operation to complete), among other examples). An MTC thread may be required to finishing executing its respective instruction before taking on another. In some implementations, the MTCs may adopt a barrel model, among other features or designs. STC's may execute a single thread at a time and may support full interrupt handling. Portions of a workload handled by a graph processing core 205 may be divided not only between the MTCs (e.g., with sixteen threads per MTC), but also between the MTCs 215 a-d and STCs 220 a-b. For instance, STCs 220 a-b may be optimized for various types of operations (e.g., load-store forwarding, branch predictions, etc.) and programs may make use of STCs for some operations and the multithreading capabilities of the MTCs for other instructions.

An example graph processing core 205 may include additional circuitry to implement components such as a scratchpad 245, uncore, and memory controller (e.g., 250). Components of the graph processing core 205 may be interconnected via a crossbar interconnect (e.g., a full crossbar 255) that ties all components in the graph processing core 205 together in a low latency, high bandwidth network. The memory controller 250 may be implemented as a narrow channel memory controller, for instance, supporting a narrow, fixed 8-byte memory channel. Data pulled using the memory controller from memory in the system may be loaded into a scratchpad memory region 245 for use by other components of the graph processing core 205. In one example, the scratchpad may provide 2 MB of scratchpad memory per core (e.g., MTC and STC) and provide dual network ports (e.g., via 1 MB regions).

In some implementations, an uncore region of a graph processing core 205 may be equipped with enhanced functionality to allow the MTCs 215 a-d and STCs 220 a-b to handle exclusively substantive, algorithmic workloads, with supporting work handled by the enhanced uncore, including synchronization, communication, and data movement/migration. The uncore may perform a variety of tasks including copy and merge operations, reductions, gathers/scatters, packs/unpacks, in-flight matrix transposes, advanced atomics, hardware collectives, reductions in parallel prefixes, hardware queuing engines, and so on. The ISA of the uncore can come from the pipelines' (MTCs and STCs) synchronous execution. In one example, the uncore may include components such as a collective engine 260, a queue engine 265, an atomic engine 270, and memory engine 275, among other example components and corresponding logic. An example memory engine 275 may provide an internal DMA engine for the architecture. The queue engine 265 can orchestrate and queue messaging within the architecture, with messages optimized in terms of (reduced) size to enable very fast messaging within the architecture. An example collective engine 260 may handle various collective operations for the architecture, including reductions, barriers, scatters, gathers, broadcasts, etc. The atomic engine 270 may handle any memory controller lock scenarios impacting the memory controller 250, among other example functionality.

FIG. 2B is a simplified block diagram illustrating an example system 200 b with a set of graph processing cores 205 a-d. A graph processing node may include a respective graph processing core (e.g., 205 a-d) and a corresponding memory (e.g., dynamic random access memory (DRAM) (e.g., 225)). Each node may include a respective graph processing core (e.g., 205), which includes a set of MTCs (e.g., 215) as well as a set of single-thread cores (STCs) (e.g., 220), such as in the example graph processing core 205 illustrated and described above in the example of FIG. 2A. In one example, multiple graph processing nodes may be incorporated in or mounted on the same package or board and interconnected via a high-radix (e.g., multiple (e.g., >3) ports per connection), low-diameter (e.g., of 3 or less) network. The example system 200 may further include interconnect ports (e.g., 230, 235) to enable the system 200 to be coupled to other computing elements including other types of processing units (e.g., central processing units (CPUs), graphical processing units (GPUs), tensor processing units (TPUs), etc. In some cases, a graph processing chip, chiplet, board, or device (e.g., system 200) may be coupled to other graph processing devices (e.g., additional instances of the same type of graph processing system (e.g., 200). In some implementations, interconnects 230, 235 may be utilized to couple to other memory devices, allowing this external memory and local DRAM (e.g., 225) to function as shared system memory of the graph processing nodes for use by graph processing cores and other logic of the graph processing nodes, among other examples.

FIG. 3A is a simplified representation of an example graph structure 300. The graph structure may be composed of multiple interconnected nodes (e.g., 305, 310, 315, 320, 325, 330, 335). An edge is defined by the interface between one graph node and respective neighboring graph node. Each node may be connected to one or more other nodes in the graph. The sparseness of graph data structures leads to scattered and irregular memory accesses and communication, challenging the decades-long optimizations made in traditional dense compute solutions. As an example, consider the common case of pushing data along the graph edges (e.g., with reference to the simplified graph 300 example of FIG. 3A). All vertices initially store a value locally and then proceed to add their value to all neighbors along outgoing edges. This basic computation is ubiquitous in graph algorithms. FIG. 3B illustrates a representation 350 of an example access stream (e.g., from node 1 (305)), which illustrates the irregularity and lack of locality in such operations, making conventional prefetching and caching effectively useless.

More generally, graph algorithms face several major scalability challenges on traditional CPU and GPU architectures, because of the irregularity and sparsity of graph structures. For instance, in traditional cache-based processor architectures, which utilize prefetching, the execution of graph applications may suffer from inefficient cache and bandwidth utilization. Due to the sparsity of graph structures, caches used in such applications are thrashed with single-use sparse accesses and useless prefetches where most (e.g., 64 byte) memory fetches contain only a small amount (e.g., 8-bytes out of 64) of useful data. Further, overprovisioning memory bandwidth and/or cache space to cope with sparsity is inefficient in terms of power consumption, chip area and I/O pin count.

Further analysis of graph algorithms shows additional problems in optimizing performance. For instance, in the execution of graph algorithms, the computations may be irregular in character—they exhibit skewed compute time distributions, encounter frequent control flow instructions, and perform many memory accesses. For instance, for an example graph-based link analysis algorithm for a search engine, the compute time for a vertex in the algorithm is proportional to the number of outgoing edges (degree) of that vertex. Graphs such as the one illustrated in FIG. 3A may have skewed degree distributions, and thus the work per vertex has a high variance, leading to significant load imbalance. Graph applications may be heavy on branches and memory operations. Furthermore, conditional branches are often data dependent, e.g., checking the degree or certain properties of vertices, leading to irregular and therefore hard to predict branch outcomes. Together with the high cache miss rates caused by the sparse accesses, conventional performance oriented out-of-order processors are largely underutilized: most of the time they are stalled on cache misses, while a large part of the speculative resources is wasted due to branch mispredictions.

As additional example shortcomings of conventional computing architectures' availability to handle graph processing, graph algorithms require frequent fine- and coarse-grained synchronization. For example, fine-grained synchronizations (e.g., atomics) may be required in a graph algorithm to prevent race conditions when pushing values along edges. Synchronization instructions that resolve in the cache hierarchy place a large stress on the cache coherency mechanisms for multi-socket systems, and all synchronizations incur long round-trip latencies on multi-node systems. Additionally, the sparse memory accesses result in even more memory traffic for synchronizations due to false sharing in the cache coherency system. Coarse-grained synchronizations (e.g., system-wide barriers and prefix scans) fence the already-challenging computations in graph algorithms. These synchronizations have diverse uses including resource coordination, dynamic load balancing, and the aggregation of partial results. These synchronizations can dominate execution time on large-scale systems due to high network latencies and imbalanced computation.

Additionally, current commercial graph databases may be quite large (e.g., exceed 20 TB as an in-memory representation). Such large problems may exceed the capabilities of even a rack of computational nodes of any type, which requires a large-scale multi-node platform to even house the graph's working set. When combined with the prior observations—poor memory hierarchy utilization, high control flow changes, frequent memory references, and abundant synchronizations—reducing the latency to access remote data is a challenge, combined with latency hiding techniques in the processing elements, among other example considerations. Traditional architectures and their limitations in being able to effectively handle graph algorithms extends beyond CPUs to include traditional GPU—sparse accesses prevent memory coalescing, branches cause thread divergence and synchronization limits thread progress. While GPUs may have more threads and much higher memory bandwidth, GPUs have limited memory capacity and limited scale-out capabilities, which means that they are unable to process large, multi-TB graphs. Furthermore, where graphs are extremely sparse (<<1% non-zeros), typical GPU memory usage is orders of magnitude less efficient, making GPUs all but unusable outside of the smallest graphs, among other example issues.

An improved computing system architecture may be implemented in computing systems to enable more efficient (e.g., per watt performance) graph analytics. In one example, specialized graph processing cores may be networked together in a low diameter, high radix manner to more efficiently handle graph analytics workloads. The design of such graph processing cores builds on the observations that most graph workloads have abundant parallelism, are memory bound and are not compute intensive. These observations call for many simple pipelines, with multi-threading to hide memory latency. Returning to the discussion of FIG. 2, such graph processing cores may be implemented as multi-threaded cores (MTC), which are round-robin multi-threaded in-order pipeline. In one implementation, at any moment, each thread in an MTC can only have one in-flight instruction, which considerably simplifies the core design for better energy efficiency. Single-threaded cores (STC) are used for single-thread performance sensitive tasks, such as memory and thread management threads (e.g., from the operating system). These are in-order stall-on-use cores that are able to exploit some instruction and memory-level parallelism, while avoiding the high-power consumption of aggressive out-or-order pipelines. In some implementations, both MTCs and STCs may implement the same custom RISC instruction set.

Turning to FIG. 4, a simplified block diagram 400 is shown illustrating example components of an example graph processing core device (e.g., 205). A graph processing core device may include a set of multi-threaded cores (MTCs) (e.g., 215). In some instances, both multi-threaded cores and single threaded cores (STCs) may be provided within a graph processing block. Further, each core may have a small data cache (D$) (e.g., 410) and an instruction cache (I$) (e.g., 415), and a register file (RF) (e.g., 420) to support its thread count. Because of the low locality in graph workloads, no higher cache levels need be included, avoiding useless chip area and power consumption of large caches. For scalability, in some implementations, caches are not coherent. In such implementations, programs that are to be executed using the system may be adapted to avoid modifying shared data that is cached, or to flush caches if required for correctness. As noted above, in some implementations, MTCs and STCs are grouped into blocks, each of which may be provided with a large local scratchpad (SPAD) memory 245 for low latency storage. Programs run on such platforms may selecting which memory accesses to cache (e.g., local stack), which to put on SPAD (e.g., often reused data structures or the result of a direct memory access (DMA) gather operation), and which not to store locally. Further, prefetchers may be omitted from such architectures to avoid useless data fetches and to limit power consumption. Instead, some implementations may utilize offload engines or other circuitry to efficiently fetch large chunks of useful data.

Continuing with this example, although the MTCs of an example graph processing core hide some of the memory latency by supporting multiple concurrent threads, an MTC may adopt an in-order design, which limits the number of outstanding memory accesses to one per thread. To increase memory-level parallelism and to free more compute cycles to the graph processing core, a memory offload engine (e.g., 430) may be provided for each block. The offload engine performs memory operations typically found in many graph applications in the background, while the cores continue with their computations. Turning to FIG. 5, a simplified block diagram 500 is shown illustrating example operations of an example graphic processing core offload engine (e.g., 430) including atomics 505 and gather operations 510, among other examples. Further, a direct memory access (DMA) engine may perform operations such as (strided) copy, scatter and gather. Queue engines may also be provided, which are responsible for maintaining queues allocated in shared memory, alleviating the core from atomic inserts and removals, among other example benefits. The logic of an offload engine can be used for work stealing algorithms and dynamically partitioning the workload. Further, the offload engines can implement efficient system-wide reductions and barriers. Remote atomics perform atomic operations at the memory controller where the data is located, instead of burdening the pipeline with first locking the data, moving the data to the core, updating it, writing back, and unlocking. They enable efficient and scalable synchronization, which is indispensable for the high thread count in this improved graph-optimized system architecture. The collective logic (or engines) of the offload engines may directed by the graph processing cores using specific instructions defined in an instruction set. These instructions may be non-blocking, enabling the graph processing cores to perform other work while these memory management operations are performed in the background. Custom polling and waiting instructions may also be included within the instruction set architecture (ISA) for use in synchronizing the threads and offloaded computations, among other example features. In some implementations, example graph processing cores and chipsets may not rely on any locality. Instead, the graph processing cores may collectively use their offload engines to perform complex systemwide memory operations in parallel, and only move the data that is eventually needed to the core that requests it. For example, a DMA gather will not move the memory stored indices or addresses of the data elements to gather to the requesting core, but only the requested elements from the data array.

Returning to FIG. 4, an example graph processing device may additionally include a memory controller 250 to access and manage requests of local DRAM. Further, sparse and irregular accesses to a large data structure are typical for graph analysis applications. Therefore, accesses to remote memory should be done with minimal overhead. An improved system architecture, such as introduced above, utilizing specialized graph processing cores adapted for processing graph-centric workload may, in some implementations, implement a hardware distributed global address space (DGAS), which enables respective cores (e.g., graph processing core or support dense core) to uniformly access memory across the full system, which may include multiple nodes (e.g., a multiple graph processing core, corresponding memory, and memory management hardware) with one address space. Accordingly, a network interface (e.g., 440) may be provided to facilitate network connections between processing cores (e.g., on the same or different die, package, board, rack, etc.).

Besides avoiding the overhead of setting up communication for remote accesses, a DGAS also greatly simplifies programming, because there is no implementation difference between accessing local and remote memory. Further, in some implementations, address translation tables (ATT) may be provided, which contain programmable rules to translate application memory addresses to physical locations, to arrange the address space to the need of the application (e.g., address interleaved, block partitioned, etc.). Memory controllers may be provided within the system (e.g., one per block) to natively support relatively small cache lines (e.g., 8 byte accesses, rather than 64 byte accesses), while supporting standard cache line accesses as well. Such components may enable only the data that is actually needed to be fetched, thereby reducing memory bandwidth pressure and utilizing the available bandwidth more efficiently.

As noted above, a system, implemented as a chiplet, board, rack, or other platform, may include multiple interconnected graph processing cores, among other hardware elements. FIG. 6 is a simplified block diagram 600 showing an example implementation of a graph processing system 602 including a number of graph processing cores (e.g., 205 a-h) each coupled to a high-radix, low-diameter network to interconnect all of the graph processing cores in the system. In this example implementations, the system may further include dense compute cores (e.g., 605 a-h) likewise interconnected. In some instances, kernel functions, which would more efficiently be executed using dense compute logic may be offloaded from the graph processing cores to one or more of the dense compute cores. The graph processing cores may include associated memory blocks, which may be exposed to programmers via their own memory maps. Memory controllers (MC) (e.g., 610) may be provided in the system to other memory, including memory external to the system (e.g., on a different die, board, or rack). High speed input/output (HSIO) circuitry (e.g., 615) may also be provided on the system to enable core blocks and devices to couple to other computing devices, such as compute, accelerator, networking, and/or memory devices external to the system, among other examples.

A network may be provided in a system to interconnect the component within the system (e.g., on the same SoC or chiplet die, etc.) and the attributes of the network may be specially configured to support and enhance the graph processing efficiencies of the system. Indeed, the network connecting the blocks is responsible for sending memory requests to remote memory controllers. Similar to the memory controller, it is optimized for small messages (e.g., 8 byte messages). Furthermore, due to the high fraction of remote accesses, network bandwidth may exceed local DRAM bandwidth, which is different from conventional architectures that assume higher local traffic than remote traffic. To obtain high bandwidth and low latency to remote blocks, the network needs a high radix and a low diameter. Various topologies may be utilized to implement such network dimensions and characteristics. In one example, a HyperX topology may be utilized, with all-to-all connections on each level. In some implementations, links on the highest levels are implemented as optical links to ensure power-efficient, high-bandwidth communication. The hierarchical topology and optical links enable the system to efficiently scale out to many nodes, maintaining easy and fast remote access.

FIG. 7 is a simplified block diagram showing the use of an example graph processing system (incorporating graph processing cores, such as discussed above) in a server system. A graph processing device (e.g., 705) may be provided with a set of graph processing cores (and in some cases, supplemental dense compute cores). A graph processing device 705 may enable specialized processing support to handle graph workloads with small and irregular memory accesses through near-memory atomics, among other features, such as discussed above. Multiple such graph processing devices (e.g., 705, 715, 720, 725, etc.) may be provided on a board, rack, blade, or other platform (e.g., 710). In some implementations, the platform system 710 may include not only an interconnected network of graph processing devices (and their constituent graph processing cores), but the system 710 may further include general purpose processors (e.g., 730), SoC devices, accelerators, memory elements (e.g., 735), as well additional switches, fabrics, or other circuitry (e.g., 740) to interconnect and facilitate the communication of data between devices (e.g., 705-740) on the platform. The system 710 may adopt a global memory model and be interconnected consistent with the networking and packaging principles described herein to enable high I/O and memory bandwidth.

In some implementations, the system 710 may itself be capable of being further connected to other systems, such as other blade systems in a server rack system (e.g., 750). Multiple systems within the server system 750 may also be equipped with graph processing cores to further scale the graph processing power of a system. Indeed, multiple servers full of such graph processing cores may be connected via a wider area network (e.g., 760) to further scale such systems. The networking of such devices using the proposed graph processing architecture offers networking as a first-class citizen, supports point-to-point messaging, and relies upon a flattened latency hierarchy, among other example features and advantages.

In one example system, a C/C++ compiler (e.g., based on LLVM) may be utilized in the development of software for use with the graph processing systems described herein. For instance, the compiler may support a Reduced Instruction Set Computer (RISC) instruction set architecture (ISA) of the graph processing system, including basic library functions. In some implementations, graph-processing-specific operations, such as the offload engines and remote atomics, are accessible using intrinsics. Additionally, the runtime environment of the system may implement basic memory and thread management, supporting common programming models, such as gather-apply-scatter, task-based and single program, multiple data (SPMD)-style parallelism. Among other tools, an architectural simulator for the graph processing architecture may be provided to simulate the timing of all instructions in the pipelines, engines, memory, and network, based on the hardware specifications. Additional software development tools may be provided to assist developers is developing software for such graph processing systems, such as tools to simulate execution of the software, generate performance estimations of running a workload on the system, performance analysis reports (e.g., CPI stacks and detailed performance information on each memory structure and each instruction), among other example features. Such tools may enable workload owners to quickly detect bottleneck causes, and to use these insights to optimize the workload for graph processing systems.

In some implementations, software developed to perform graph analytics using the improved graph processing architecture discussed herein may be implemented as basic kernels, library overhead may be limited. In networked systems of multiple graph processing cores, the application code does not need to change for multinode execution, thanks to the system-wide shared memory. As an example, a software application may be written to cause the graph processing system to perform a sparse matrix dense vector multiplication (SpMV) algorithm. The basic operation of SpMV may include a multiply-accumulate of sparse matrix elements and a dense vector. A matrix input may be provided (e.g., an RMAT-30 synthetic matrix) stored in compressed sparse row (CSR) format. In one example, a straightforward implementation of SpMV may be programmed, with each thread of the graph processing cores calculating one or more elements of the result vector. The rows are partitioned across the threads based on the number of non-zeros for a balanced execution. It does not make use of DMA operations, and all accesses are non-cached at a default length (e.g., 8-byte), with thread local stack accesses cached by default. Such an implementation may outperform high performance CPU architectures (e.g., Intel Xeon™) through the use of a higher thread count and 8-byte memory accesses, avoiding memory bandwidth saturation. In other implementations of an SpMV algorithm may be programmed to execute on the graph processing architecture utilizing selective caching. For instance, accesses to the matrix values are cached, while the sparse accesses to the vector bypass caches. In the compressed sparse row (CSR) representation of a sparse matrix, all non-zero elements on a row are stored consecutively and accessed sequentially, resulting in spatial locality. The dense vector, on the other hand, is accessed sparsely, because only a few of its elements are needed for the multiply-accumulate (the indices of the non-zeros in the row of the matrix). Accordingly, the accesses to the matrix are cached, while the vector accesses remain uncached 8-byte accesses, leading to a further potential performance improvement relative to CPU architectures. Further, an implementation of the SpMV algorithm may be further enhanced using a graph processing architecture, for instance, by a DMA gather operation to fetch the elements of the dense vector that are needed for the current row from memory. These elements may then be stored on local scratchpad. The multiply-accumulate reduction is then done by the core, fetching the matrix elements from cache and the vector elements from scratchpad. Not only does this significantly reduce the number of load instructions, it also reduces data movement: the index list does not need to be transferred to the requesting core, only the final gathered vector elements. While data is gathered, the thread is stalled, allowing other threads that have already fetched their data to compute a result vector element.

Programs, such as the examples above, may be designed to effectively use the graph processing architecture (e.g., using more than 95% of the available memory bandwidth, while not wasting bandwidth on useless and sparse accesses) and realize potentially exponential efficiency improvement over traditional architectures. Further, the improved graph processing architecture provide much higher thread count support (e.g., 144 threads for Xeon, verses thousands of threads (e.g., 16,000+) in the graph processing core implementation), enabling threads to progress while others are stalled on memory operations, efficient small size local and remote memory operations, and powerful offload engines that allow for more memory/compute overlap. Scaling graph processing systems (e.g., with multiple nodes) may yield compounding benefits (even if not perfectly linear, for instance, due to larger latencies and bandwidth restrictions or other example issues) to significantly outperform other multi-node conventional multinode processor configurations. While the examples focus on an SpMV algorithm, it should be appreciated that this example was offered as but one of many example graph algorithms. Similar programs may be developed to leverage the features of a graph processing architecture to more efficiently perform other graph-based algorithms including application classification, random walks, graph search, Louvain community, TIES sampler, Graph2Vec, Graph Sage, Graph Wave, parallel decoding FST, geolocation, breadth-first search, sparse matrix-sparse vector multiplication (SpMSpV), among other examples.

As noted above, sparse workloads exhibit a large number of random remote memory accesses and have been shown to be heavily network and memory bandwidth-intensive and less dependent on compute capability. While the graph processing architecture discussed herein provides efficient support for workloads that are truly sparse (and may be alternatively referred to as “sparse compute” devices), such a graph processing architecture lacks sufficient compute performance to execute dense kernels (e.g., matrix multiply, convolution, etc.) at needed performance in some applications. Dense kernels are a critical component of many critical compute applications such as image processing. Even with matrix computation units included, a challenge remains of effective integration of dense compute and offloading operations with regards to memory movement, matrix operation definition, and controllability across multiple threads.

Traditional offloading techniques (e.g., for offloading to an on-chip accelerator in an SoC) include memory mapped registers. For instance, the pipeline/core can perform the offload of the computation by writing to memory mapped registers present inside the accelerator. These registers may specify configurations as well as data needed to be used for the computation. This may also require the pipeline to monitor/poll registers if it is not sure that the offload engine is idle. In one example of a graph processing, an enhanced offload mechanism may be used to offload dense compute work from the graph processing cores to dense compute cores. There is a hardware managed queue that stores incoming offload instructions and monitors the current status of the pipeline and launches the instructions sequentially, enabling an easy offload mechanism for the software. Multiple graph processing core threads can each use the dense compute bandwidth of the dense compute cores by calling a new ISA function (e.g., by calling the dense.func) without worrying about the status of the dense core and whether other cores are using the dense core at the same time. The offload instruction can also allow efficient and simple passing of the program counter and operand addresses to one of the dense compute cores as well. The queue gives metrics through software readable registers (e.g., the number of instructions waiting (in a COUNT value)) and can help in tracking average waiting requests and other statistics for any dense core.

As noted above, a graph processing architecture may be particularly suited to operate on sparse workloads exhibiting a large number of random remote memory accesses and that are heavily network and memory bandwidth-intensive and less dependent on compute capability. To efficiently address this workload space, a graph processing architecture has a highly scalable low-diameter and high-radix network and many optimized memory interfaces on each die in the system. While this architectural approach provides efficient support for workloads that are truly sparse, providing a system with graph processing cores alone lacks sufficient compute performance to execute dense kernels (e.g., matrix multiply, convolution, etc.) that may be utilized in some application. To correct this performance gap, some systems incorporating a graph processing architecture may further include dense compute cores in addition to the graph processing cores, such as illustrated in the example of FIG. 6. In this example, eight dense compute cores (e.g., 605 a-h) are incorporated into each die of a graph processing device (e.g., 602) to be incorporated in a system. In such implementations, kernel functions are offloaded from threads in the graph processing cores (e.g., 205 a-h) to any dense core 605 a-h in the system 602 via directed messages.

In one example implementation, the compute capability within each dense core is implemented with a 16×16 reconfigurable spatial array of compute elements or systolic array (also referred to herein as a “dense array (DA)”). In some implementations, the reconfigurable array of compute elements of a dense compute core may be implemented as a multi-dimensional systolic array. This array is capable of a variety of floating point and integer operations of varying precisions. In this example, such an array can, in total, at a 2 GHz operating frequency a single dense core can achieve a peak performance of 1 TFLOP of double precision FMAs. Respective dense cores may have a control pipeline responsible for configuring the DA, executing DMA operations to efficiently move data into local buffers, and moving data into and out of the DA to execute the dense computation. The specific characteristics (e.g., memory locations, compute types, and data input sizes) of the operations vary based on the corresponding kernel. These kernels are programmed by software and launched on the control pipeline at a desired program counter (PC) value.

In some implementations, graph processing cores within a system that also include dense compute cores may include a dense offload queue and corresponding hardware circuitry to perform offloads from the graph processing core to the dense compute core control. This offload pipeline is managed intelligently by hardware managed through the dense offload queues (DOQ) to thereby simplify programmability for the software offloading the dense compute. With full hardware management, there is no need for software to check for the idleness of the dense compute or having to manage the contents and ordering of the queue, among other example benefits. The hardware circuitry managing the DOQs may also handle passing of the required program counter (PC) information, the operand, and the result matrix addresses to the control pipeline in a simple manner, among other example features.

In some implementations, a specialized instruction in the graph processing architecture ISA may be provided as a handle for initiating a request to a dense compute core. For instance, the software may use a dense function ISA instruction (e.g., ‘dense.func’) to trigger the offloading of a task from a graph processing core to a dense compute core by sending an instruction packet over the network interconnecting the cores from the graph processing core to one of the dense compute cores. The request may include the address of the target dense compute core, which may be used by the network to route the packet to the appropriate dense compute core. The request packet may be received at the dense offload queue (DOQ) corresponding to the targeted dense compute core.

Turning to FIG. 8, a simplified block diagram is shown illustrating an example dense compute core 605. Dense compute cores (e.g., 605) may include an array 812 of interconnected compute units, which provide the dense computing functionality of the dense compute core. In some examples, a 16×16 array of compute elements may be provided. A dense compute core 605, in one example implementation, may also include a dense offload queue 804 and control pipeline 810 and crossbar (XBAR) 808 to support the movement of data between the dense compute core and other components of the system (e.g., graph processing cores, memory controllers and associated blocks of shared memory, other dense compute cores, etc.). Logic for executing a dense offload instruction may be implemented as a decoder circuit and/or an execution circuit (e.g., execution unit) in the dense offload queue, the control pipeline, or other components of the dense compute core. Various instructions may be received for a dense computing core at its dense offload queue (e.g., 804).

In some implementations, control pipeline 810 may be implemented as a single-threaded pipeline for managing and orchestrating hardware of the dense compute core 605 for performing various functions. For instance, control pipeline 810 may configure the reconfigurable array of compute elements 812 in one of a variety of possible configurations, read data from local or remote memory (e.g., through DMA calls to shared memory), copy/write such data to local scratchpad memory 816 of the dense compute core for use by the array 812, load instructions corresponding to a set of functions, instructions, kernel, or other program (e.g., based on a program counter value) for execution by compute units in the array, move result data (e.g., data generated during execution of the dense workload offloaded to the dense core) from the dense compute core (e.g., from scratchpad (e.g., 816) to memory accessible to a graph processing core (e.g., through a remote atomic), update registers identifying progress of the workload execution by the array of compute circuits, among other example tasks and functions.

Dense offload queue 804 may be utilized to provide hardware-managed orchestration and monitoring of workloads offloaded to the corresponding dense compute core 605 (e.g., from a sparse-compute graph processing core). The dense offload queue 804 may maintain a hardware-based queue of received instructions, may identify when the control pipeline 810 (and compute array 812) are available to handle a next instruction in the queue, and monitor the status of the control pipeline and performance of functions associated with an offload request. In this manner, the dense offload queue 804 may simplify software development for platforms incorporating a mix of sparse graph processing cores and dense processing cores by implementing the orchestration and monitoring of offloaded dense compute tasks in hardware. For instance, a single instruction (e.g., a dense offload instruction (e.g., dense.func)) may be defined in the ISA of the platform to simply and elegantly allow hardware to manage offloading of tasks and the performance of these tasks by a corresponding dense compute core (e.g., 605). The dense offload queue 804 can cause or launch action by the control pipeline 810 including the performance of actions using in crossbar 808, DMA engine 820, and/or micro-DMA engine 814 to appropriately configure the dense compute core hardware to perform a set of particular tasks, kernel, or other program. In certain embodiments, memory interface 822 is coupled to a (e.g., system) memory, e.g., shared memory external from the dense compute core 605. In certain embodiments, other components (e.g., core(s)) are coupled to core 605 via network switch 802, such as other dense compute cores and graph processing cores, among other example elements.

In certain embodiments, a micro-DMA engine 814 is coupled to the array of compute circuits 812, a scratch pad memory 816 (e.g., memory address accessible), and/or a buffer 818 (e.g., not memory address accessible) that bypasses the SPAD. In one embodiment, local scratchpad (SPAD) 816 is used to hold data that is subject to high reuse and bypass SPAD buffer 818 is used for low-reuse to reduce offload latency. Thirty-two parallel input/output ports are used as an example, and it should be understood that other numbers of ports may be utilized, e.g., 64, 128, etc. In certain embodiments, micro-DMA engine 814 is not coupled to memory external to core 605 and/or is not part of a cache coherency hierarchy.

In some implementations, the array of compute circuits 812 of a dense compute core is implemented as a multi-element (e.g., 16 element×16 element) reconfigurable spatial array of compute circuits (e.g., a dense array (DA)) capable of a variety of floating point and integer operations of varying precisions (e.g., a grid of floating-point unit (FPU) and/or arithmetic-logic unit (ALU) blocks). The reconfigurability of the array of compute circuits 812 allows for multiple options for connectivity between its internal compute circuits. In certain embodiments, the connectivity is pre-configured in the array of compute circuits 212 before (e.g., kernel) execution begins. Embodiments herein utilize a reconfigurable array of compute circuits because (i) given optimal array configuration, it provides high compute efficiency for a subset of kernels under a variety of input and output matrix sizes, and the programmability of the DA (e.g., via the μDMA instructions) seamlessly integrates into an ISA (e.g., an ISA for the second core type) with minimal control pipeline modifications, among other example features and benefits.

To achieve the optimal combination of ease of programmability and high compute performance through an array of compute circuits, embodiments herein utilize a DMA engine (e.g., micro-DMA engine) to provide the following features: (i) flexibility in the input/output matrix characteristics (e.g., configurability of row and/or column dimensions as well as the organization of the data structure in memory (e.g., row major or column major)), (ii) supporting the method of data movement and memory access patterns for multiple modes of the array (e.g., multicast, unicast, or systolic mode), and (iii) providing high parallelism at each array input/output to hit the highest performance.

The dense offload queue 804 manages incoming dense function requests passed from graph processing cores in a system. For instance, when a dense function request is received, the DOQ 804 will store it in its local memory buffer (DOQ SRAM 806). Whenever the control pipeline 810 has completed execution of the previous kernel and becomes free (or immediately if its already free), the DOQ 804 pops the function from its queue and launches the corresponding thread on the control pipeline 810. Accordingly, in some implementations, the DOQ 804 is responsible for both queue pointer management in the dense compute core, as well as serializing and launching the dense functions in the order that they were received and monitoring the status of the control pipeline 810 to determine when they need to be popped off the queue. Further, the DOG 804 can load the matrix addresses passed along with a dense functional instruction call (e.g., dense.func) into the register file of the control pipeline 810 and thus enables the control pipeline ease of access to this data, among other example functions.

Table 1 shows the structure of one example implementation of a dense function (dense.func) instruction:

TABLE 1 Example dense.func instruction description ASM Form Instruction Arguments Argument Descriptions dense.func r1, r2, r3, r4, r1 = start PC; r2 = Matrix A address; r3 = r5 Matrix B address; r4 = Matrix C address; r5 = Target dense core address

In the example instruction illustrated in Table 1:

-   -   r1—Starting PC of the kernel. This value is passed to the         control pipeline 810 when execution begins.     -   r2—Base address for input matrix A. The control pipeline 810         will use this when copying data from off-die memory to the dense         core's local scratchpad (e.g., 816).     -   r3—Base address for input matrix B. The control pipeline 810         will use this when copying data from off-die memory to the dense         core's local scratchpad 816.     -   r4—Base address for output matrix C. The control pipeline 810         will use this when copying data from the dense core's local         scratchpad 816 to off die memory or other memory outside the         dense compute core.     -   r5—Address of the dense core that is to execute the dense         function

FIG. 9 is a simplified block diagram 900 illustrating the example microarchitecture of an example dense offload queue 804, as may be utilized in some implementations. An input arbiter block 940 may be provided that receives instructions passed to a corresponding dense compute core on a network coupling the dense compute core to other dense compute cores as well as graph processing cores on the platform. The arbitrator block 940 may pass the relevant instructions (that utilize the DOQ) to the DOQ. The DOQ 804 may include an input interface unit 905, which include input flops and an input first-in-first-out (FIFO) queue (e.g., to absorb received requests in instances of stalling or backpressure). In one example, the input FIFO may be four entries deep, although other implementations may utilize queues of different depths. The input interface unit 905 may check for the correct opcode and send a negative acknowledgment (NACK) if the opcode does not match dense.func. The dense offload queue 804 may further include, in some implementations, a response FIFO queue 915 and machine specific register (MSR) block 920, which includes two write ports: one from MSR block and another from input interface for ACK/NACK responses. The response FIFO queue 915 may queue and send ACK/NAK responses to the requesting graph processing core to communicate to the requesting graph processing core that its instruction has/has not been accepted. In the event the instruction is negatively acknowledged, the graph processing core may attempt a different instruction or attempt to offload a workload through the instruction to another dense compute core, among other examples.

Continuing with the example of FIG. 9, The dense offload queue 804 may additionally include a dense instruction FIFO 925, which serves as the hardware queue for queuing offload instructions received for the dense compute core. In one implementation, if the dense instruction FIFO 925 is not full and the input FIFO is not empty, then the input interface unit 905 sends write requests to the dense instruction FIFO 925. If the dense instruction FIFO 925 is full then the input interface unit 905 refrains from sending any more write requests to the dense instruction FIFO 925 until the queue is at least partially cleared. The input interface unit 905 may receive a response from the dense instruction FIFO 925 once the data has been stored, and forwards it to response FIFO 915 as an instruction response (e.g., for forwarding to the requesting graph processing core). An input “FIFO full” signal may be defined as a backpressure signal to the input arbiter block 940 to not accept any more instructions. For instance, when the input FIFO is full, it may create a NACK-retry response packet to be sent through the response FIFO 915, in some implementations.

In some examples, the dense instruction FIFO 925 may be implemented as a static random access memory (SRAM)-based FIFO that is 64 deep and 256 bits wide. This dense instruction FIFO 925 receives the instruction from the input interface unit 905. It sends the data to a thread launch block 950 when the queue is not empty and the control pipeline 810 is in a thread IDLE state. Due to its depth (in this example), the dense instruction FIFO 925 can hold a maximum of 64 instructions in flight. Accordingly, the DOQ 804 may be equipped with functionality to allow the DOQ 804 to check the status of the control pipeline 810 to know if a thread is currently active. For instance, this can be tracked and observed by a “Thread 0 FULL” bit in a thread control register 960. In some implementations, to optimize performance, the DOQ 804 can monitor thread activity by monitoring this thread control register value directly, for instance, as a single wire from the MSR block 920. When it is determined that the control pipeline 810 is free, the thread launch block 950 can pull data from the dense FIFO queue 925 and cause a corresponding program to be launched by the control pipeline 810 by writing initial values to the register file 965 used by the control pipeline 810 and sending additional data through a pipeline arbiter block 970 for use by the control pipeline 810 in accessing instructions corresponding to the functions identified in the offloaded workload, configuring the dense compute core array, loading the appropriate operand data in local scratchpad of the dense compute core array, and directing the movement of data and execution of instructions in accordance with performance of the functions by the dense compute array.

In some implementations, machine specific registers (MSRs) (e.g., in MSR block 920) may include one or more registers to identify (e.g., to software and/or other compute blocks of a platform) the status of the queue maintained by the DOQ 804, as well as the status of functions launched by the DOQ 804. In one example, the MSRs may include one or more status registers and one or more count registers. Given that the DOQ 804 manages a single queue of a set size (e.g., 64 entries), there may be no need for configuration of address and size, simplifying the registers in MSR block 920. In one example, the MSR status register may be utilized to record and identify the status of the single queue buffer that is managed by the dense offload queue 804. Bit fields in this register may include “full”, “empty”, and exception information. A Count register may be utilized to identify the number of elements currently in the queue. If this value is non-zero, the thread launch block will know to send the information to the control pipeline to launch the thread, among other examples.

Once the DOQ 804 detects that the control pipeline 810 is idle and the queue has a valid instruction waiting, the DOQ 804 uses the instruction information to launch the thread on the control pipeline. Accordingly, the DOQ 804 may begin by pulling the oldest instruction off of the tail of the dense FIFO queue and update the internal FIFO/queue pointers accordingly. The DOQ 804 may then issue a store operation of the Matrix A Address into the control pipeline register file (e.g., an 8 Byte store into R1), as well as issue a store operation of the Matrix B Address into the control pipeline register file (e.g., an 8 Byte store into R2) and a store operation of the Matrix C Address into the control pipeline register file (e.g., an 8 Byte store into R3). The DOQ 804 may additionally issue a store operation of the Start Program Counter (PC) value to the control pipeline thread 0 program counter MSR (e.g., an 8B store) to identify the first instruction to be executed in a set of instructions or functions embodying the workload to be performed by the dense compute core. The DOQ 804 may then issue a store operation to set the FULL and ENABLE MSR bits in the control pipeline MSR space to kick start the thread. Once these stores are all successful, the thread has been launched and the DOQ's role has concluded for that specific instruction. The DOQ 804 should now return to monitoring the idle status of the control pipeline for launching the next dense.func (if an instruction is waiting on the queue) on the control pipeline. Once an offload instruction is received and queued (or launched) by the DOQ, the DOQ may send an acknowledgement response, such as discussed above.

A dense offload queue (DOQ) may implement a hardware-based state machine and queue to enable offloading of tasks from graph processing cores to dense compute cores simply and efficiently and with minimal status monitoring my software. The DOQ may manage the queue of offload instructions received at a given dense compute core and launch corresponding threads at the hardware level, allowing software to simply utilize a dense offload function call (e.g., dense.func) and information provided through DOQ MSRs, rather than more granular management of dense compute core usage. FIG. 10 is a simplified flow diagram 1000 illustrating example flows involved in the offloading of functions from a graph processing core (e.g., 205) to a particular dense compute core 605 communicatively coupled within a network (e.g., a high-radix, low diameter network implemented on or between dies).

A graph processing core 205 may identify 1005 one or more functions in an algorithm (being executed by the graph processing core) that involve dense compute functions. Given the sparse compute architecture of the graph processing core 205 to simplify the software model through the provision of a specialized ISA instruction (e.g., dense.func), which the software may call to trigger the use of this hardware-implemented queue (e.g., the DOQ). In response to identifying the dense workload, the graph processing core 205 may send 1010 a dense offload instruction (e.g., 1015) over the network to a particular dense processing core (e.g., 605). As multiple dense compute cores may be accessible to the graph processing core over the network, the graph processing core (e.g., utilizing one or more of its single-threaded pipelines) may determine (e.g., from a register or table maintained in software), which dense compute core(s) are available and have bandwidth to assist in handling this dense compute workload. Accordingly, the graph processing core may identify the particular dense compute core as a candidate and address the instruction 1015 accordingly.

A dense offload queue (DOQ) 804 may queue 1020 offload instructions (and potentially other instructions) received on the network. As discussed above, the DOG 804 may monitor the status of the dense core control pipeline to determine the availability of dense compute resources. In some implementations, upon queuing a received instruction (e.g., 1020), the DOQ 804 may send 1025 a response (e.g., 1095) back to the requesting graph processing core 205 to acknowledge (e.g., 1055) the instruction. In other implementations, an additional or alternative response may be sent based on the launch of functions corresponding to the instructions, among other example implementations. The queue maintained by the DOQ may be FIFO in that, as the dense compute core finished performing a workload corresponding to the earliest received offload instruction, the DOQ may cause the next received offload instruction to be launched (e.g., 1035) upon determining 1030 that the control pipeline is available (e.g., finished with the preceding instruction's workload). Accordingly, the DOQ 804 may message 1036 the control pipeline 810 to reactivate the control pipeline and provide the control pipeline with the information included in the next received instruction (e.g., 1015). The control pipeline 810 may use this information and launch performance of the corresponding functions by configuring 1040 the dense compute array, moving data 1045 to the local scratchpad of the dense compute core (e.g., from shared memory via DMA calls), and identify the set of functions corresponding to the workload and orchestrate the performance of these functions using the dense compute core's resources (e.g., compute array, scratchpad, local memory, etc.).

After launching 1035 the functions, the DOQ 804 may additionally monitor 1065 the status of the control pipeline 810 to the determine whether the performance of the corresponding functions is complete. This may be identified through a register associated with the control pipeline. In some implementations, the pipeline may deactivate following completion of the functions to indicate its availability and status of the functions, among other example implementations. The control pipeline 810 may also orchestrate the delivery of results or outputs (e.g., 1075) generated from performance of the function(s) by causing 1070 result data to be written to shared memory or other memory accessible to the requesting graph processing core. The graph processing core may access 1080 the result data and utilize these results (in some implementations) in the performance of subsequent sparse compute functions associated with a graph analytic algorithm performed using the graph processing core(s). In some implementations, software may cause a program or algorithm to be parallelized by splitting the workload across multiple graph processing cores and/or dense processing cores. For instance, in one example, a result generated by a dense compute core may be accessed and utilized by another graph processing core, other than the graph processing core that requested the corresponding workload offload, among other examples. In some cases, offloaded functions may be blocking functions, while in other instances, the offloaded functions may be performed in parallel with other functions performed by the requesting graph processing core. Indeed, the platform utilizing both graph processing cores and dense compute cores may be highly flexible in its applications and the programs that may be crafted to perform various graph-based algorithms in efficient and optimized ways.

As noted above, a DOQ may monitor 1065 the status of the control pipeline and determine 1090 when the control pipeline has completed directing the performance of a given workload and is available to be reactivated again to launch a next workload associated with a next offload instruction in the DOQ queue. Indeed, upon identifying the availability of the control pipeline 810, the DOQ 804 may pop the next instruction off the queue and repeat the flow by again launching functions associated with this next instruction and so on in accordance with the programs being run on the platform.

Distributed Simulation System

System simulations may typically involve a functional and/or timing simulations of the system. A functional simulation may refer to the simulation of a design for hardware logic function(s) of a system, for example, to test the functionality of the logic circuits within one of the systems described above (e.g., those in FIGS. 6-7). A functional simulation may include the execution of source code for the logic function(s) on a computing platform that is not the target platform (e.g., one of the systems described above with respect to FIGS. 6-7). That is, a functional simulator may model, on an x86-based architecture, the execution of an application written to be executed on a different architecture implementing a different ISA (i.e., with a different ISA than an x86 ISA). For instance, a functional simulator may simulate how software threads go through their respective executions, what data values are during execution of the software, what branches are taken during execution, etc. A timing simulation may refer to a simulation of the timing needed for logic function(s) to complete. Thus, as opposed to functional simulations, timing simulations account for delays and latencies in the hardware logic based on the logic design. The timing simulation may, for instance, determine a number of cycles needed to perform various instructions on a simulated architecture. By performing both functional and timing simulations, a designer may be able to model the overall performance of a hardware architecture.

Traditional system simulators may use a functional-timing integrated paradigm, which can be more complex to develop and maintain. Functional-first setups, wherein the functional and timing models run on a single computing node, cannot scale out to model large, datacenter-level systems such as those described above (e.g., with respect to FIG. 7). Moreover, functional-first setups with tight timing feedback will run very slowly when using multiple computing nodes to perform the simulation.

Simulation systems may be used to model the graph processing architectures described above and their operation. However, in some instances, such an architecture may include thousands of threads and multiple terabytes (TB) of memory capacity, and simulation of such an architecture will not be possible on a single computing node.

Accordingly, in embodiments herein, the functional and performance models of the simulation may be distributed across multiple different computing nodes, with the functional and timing simulators being split. The separate functional and timing simulators may communicate through a Message Passing Interface (MPI) protocol, e.g., Intel® MPI or OPENMPI). Each computing node implementing a simulator may be implemented using one or more of the architectures described further below. In certain embodiments, the functional simulator can span multiple computing nodes to take advantage of the combined memory capacity of the multiple nodes, with the timing simulator running on another separate computing node. The timing simulator may receive streams of dynamic instructions from the one or more functional simulators and may provide relaxed timing feedback to the functional simulators. In certain embodiments, the timing simulator may receive a set of instructions from each of the functional simulators before sending timing information back to the functional simulator (as opposed to providing timing information for one instruction at a time). The functional simulators may use the timing information from the timing simulator, e.g., to rebalance thread load distribution, to synchronize execution of multiple threads, wait to execute an instruction until another instruction has “completed” execution, etc.

This simulation methodology may allow for the simulation of large graph processing architectures, such as systems utilizing the components and principles described above that span multiple nodes. Additionally, this simulation methodology may allow for making and validating design decisions of network architectures that implement the graph processing architectures described herein, among other example applications and advantages. For instance, example graph processing systems may utilize an architecture specialized at running graph analytics workloads and may have a very large memory capacity (e.g., 100+ TB), high processor counts, and high-performance networks that interconnect all of the processors and memories together. In the full system, a massive number (e.g., 10,000+) of simultaneous execution threads may exist, running on a combination of high-performance single-threaded processors (e.g., STCs described above), highly-threaded processors (e.g., MTCs described above), and programmable accelerators (e.g., DMA units or other offload engines), among other example systems. Traditional simulators may not be able to simulate such systems, while the simulation systems described herein may be able to.

FIG. 11 illustrates an example system simulator 1100 with separate timing and functional simulators in accordance with one or more embodiments. In the example shown, there are multiple host computing nodes 1102A-N connected over a network 1120. Each host computing node 1102 may be implemented as a bare-metal machine, or as a virtual machine or container executing on an underlying bare-metal platform. The network 1120 may be implemented as an Ethernet-based network an Infiniband™-based network, or another suitable type of network. In some embodiments, the network 1120 may implement a TCP/IP protocol for communication between the various computing nodes 1102. For instance, in certain embodiments, the timing simulator 1112 may maintain a set of two-way channels over the network 1120 with each of the respective functional simulators 1114 (i.e., one for each graph processing architecture component being simulated by the functional simulator (e.g., a core or accelerator)), which may be implemented as TCP sockets.

Each computing node executes a simulator, with the computing node 1102A executing a timing simulator 1112 and computing nodes 1102B-N executing respective functional simulators 1114. The system simulator 1110 may simulate the functionality and timing of a graph processing architecture, such as the ones described above, e.g., with respect to FIGS. 6-7. For instance, using a distributed simulation system as described herein, the execution of a full relevant graph kernel on a multi-node graph processing system may be effectively simulated to obtain functionality and timing information.

In certain embodiments, the functional simulators 1114 may simulate the functionality of one or more cores of a graph processing architecture as disclosed herein, e.g., the STCs and MTCs described above. The functional simulator 1114 running on each computing platform 1102 may simulate multiple cores (and thus multiple threads) of the graph processing system being simulated. For instance, in some embodiments, each functional simulator 1114 may simulate a few hundred STCs 220, MTCs 215, or a combination thereof of a graph processing system. As an example, a graph processing system consisting of 512 STCs 220, 1,024 MTCs 215, and 16,896 overall threads, can be simulated on 33 host computing nodes—32 nodes to implement the functional simulator, and the remaining computing node to implement the timing simulator—at more than 10,000 instructions per second. The computing nodes to implement the functional and timing simulators may be implemented using one or more of the computing architectures as described further below.

The functional simulators 1114 may run applications written in the ISA of the graph processing architecture and can communicate with one another using MPI-based communications over the network 1120 (e.g., to synchronize execution). The timing simulator 1112 can simulate the timing of the ISA instructions executed by the functional simulators 1114 and may communicate with the functional simulators 1114 over the network 1120 to provide timing information and/or exchange synchronization information as described further below. In some embodiments, the timing simulator 1112 may implement a “functional-first with timing feedback” paradigm, whereby each execution thread simulated by functional simulator 1114 (simulating a large core, threaded core, or accelerator of a graph processing architecture) sends a dynamic instruction stream over an inter-process communication channel (e.g., UNIX pipe, TCP socket, etc.) to the timing simulator 1112, and thereafter receives feedback from the timing simulator 1112 indicating a duration of each instruction, e.g., in a number of cycles. In this way, the different threads of execution in the functional side of the simulator remain synchronized, and thus, timing-dependent behavior may be modeled correctly. In some embodiments, this may include explicit synchronization operations, e.g., waiting on an accelerator to finish its task, the effects of load imbalance when threads see different performance effects due to cache misses or communication to remote memory, etc.

By splitting the functional simulator across multiple simulation nodes (1102B-N), the simulator 1100 may be able to handle the massive thread count and the requisite memory capacity of a full graph processing system implemented using the architecture discussed herein (e.g., with respect to FIGS. 6-7). The timing simulator 1112 may require far less memory than the graph processing system, as it only needs to store the timing-relevant microarchitectural state of the system, and the simulated system may have small caches. Further, since most timing simulators are not concerned about data values, the footprint of the timing simulator 1112 may be much smaller than the memory capacity of the simulated machine. On the other hand, the timing simulation may also require fine-grained interactions (e.g., cores and memory may communicate with each other at nano-second time scales) so splitting the timing simulation across multiple nodes might create a synchronization bottleneck. Accordingly, in some embodiments, the timing simulator 1112 may run as a single multi-threaded application on a single computing node (e.g., on 1102A only, as shown in FIG. 11).

FIG. 12 illustrates an example instruction communication flow between functional simulators 1214 and a timing simulator 1210 in accordance with one or more embodiments. In the example shown, the simulator system 1200 includes functional simulators 1214 that are simulating execution of instruction sets (e.g., 1220A-C) of a single thread 1216 each (though they may simulate execution of thousands of other threads in certain embodiments). In certain embodiments, the functional simulators may execute on separate computing nodes as described with respect to FIG. 11; however, the aspects described below may be applicable for threads 1216 executing on the same functional simulator 1214 of a computing node as well. The system simulator 1200 also includes a timing simulator 1210 that executes on a separate computing node from the functional simulators, and runs different timing simulator instances 1212A-B that corresponding to respective threads 1216A-B of the functional simulators 1214A-B.

Traditionally, functional-first simulators that employ timing feedback maintain tight synchronization between the functional and timing simulators and provide timing feedback for every instruction or every basic block. In cases where the feedback loop is carried across a TCP connection between different or multiple machines, even when using a low-latency network such as an Infiniband™-based network, waiting for a network round-trip after every simulated instruction results in an unacceptable slowdown of the simulation. Thus, in embodiments herein, the instructions themselves, or instruction information (e.g., opcodes, data addresses associated with the opcodes, and/or the data values for the opcodes), may be sent to the timing simulator and the timing simulator may simulate a set of instructions in a batched manner, while providing global synchronization points (e.g., 1230) for the entire simulator system 1200.

For instance, in the example shown in FIG. 12, each functional simulator executes a set of N instructions (e.g., 1220A, 1220B) and sends the instructions or information associated with the instructions to the timing simulator 1210. This may be done via a connector module in the functional simulator that intercepts the dynamic instruction stream from each component being simulated and sends the stream over the network to its corresponding timing simulator instance (e.g., 1212A). The sending of the instructions or the instruction information may be done asynchronously so the functional simulator can execute a plurality of instructions and enqueue them to the timing simulator without waiting.

Periodically, or after a certain number N of instructions, a synchronization marker (e.g., 1222A) is sent to the timing simulator 1210. The number N may be on the order of 100s of instructions or 1000s of instructions, in certain embodiments. The synchronization marker 1222 may be a synchronous request sent over the communication channel between the functional and timing simulators. After sending the synchronization marker 1222 to the timing simulator instance (e.g., 1212A), the functional simulator (e.g., 1214A) or the thread of the functional simulator (e.g., 1216A) that sent the synchronization marker 1222 may stall further execution of instructions, ensuring instructions are not executed at a higher rate than what the timing simulator indicates is allowable based on the set of instructions, memory characteristics of the simulated system, or other reasons.

In the timing simulator 1210, all threads of execution (e.g., from high-performance cores, multi-threaded cores, and accelerators) consume the instruction/instruction information sent by the functional simulators 1214. The timing simulator 1210 determines how much time (e.g., in cycles) is required to execute the sets of instructions (e.g., based on instruction latencies and dependencies, cache hits/misses, memory and network latencies and bandwidth congestion, etc.). The timing simulator 1210 may determine a number of cycles based on the instruction or instruction information provided by the functional simulator. In embodiments where instruction information is sent, the instruction information sent may be dependent on the type of instruction to be simulated, as the timing of different types of instructions may be based on different information. For example, where an instruction is a math operation, the timing simulator would need to know which type of math operation is being simulated (e.g., addition or multiplication) as the different types may require different numbers of cycles to execute. As another example, where there is a load or store operation to be simulated, the timing simulator 1210 may need the data address on which to operate, as it may simulate cache hits/misses, etc.

The timing simulator 1210 may track a number of cycles required for each set of instructions (e.g., 1220A, 1220B) provided by each thread to execute, and may institute global synchronization points (e.g., 1230) to prevent execution of instructions on the functional simulators from occurring too quickly. This may allow for the many simulated threads to maintain relative synchronization with one another during the simulation.

For instance, in the example shown in FIG. 12, the thread 1216A may take X number cycles to execute its first set of instructions 1220A, while the thread 1216B may take Y number cycles to execute its first set of instructions 1220B, where Y>X (e.g., Y=2X). Because it takes more cycles for the thread 1216B to execute the same number of instructions, the thread will be ahead in time relative to the thread 1216A after both execute the same number of instructions. The global synchronization point 1230 may prevent the thread 1216A from executing any further instructions, allowing it to catch up in time to the thread 1216B.

The global synchronization points (e.g., 1230) may be determined by the timing simulator 1210 as it tracks timing information related to execution of the various sets of instructions (e.g., 1220) provided by the threads (e.g., 1216). As shown in FIG. 12, the timing simulator 1210 may wait to send timing information back to the functional simulators 1214 until the global synchronization point 1230 has been reached. The threads may then resume their functional simulations of the next set of N instructions (e.g., 1220C) after receiving the timing information back from the timing simulator, as shown in FIG. 12.

The global synchronization point 1230 may be determined by tracking timestamps for each simulated thread. For instance, in some embodiments. the timing simulator 1210 may maintain a local timestamp (e.g., 1218A-B) associated with each executing thread (e.g., 1216A-B), and may increment each thread's local timestamp based on the timing simulation of the set of instructions provided by a thread. Whenever a synchronization message is received from a thread of a functional simulator (e.g., 1216A), the timing simulator corresponding to that component (e.g., 1212A) may compare the local timestamp of a thread to a current global timestamp 1219 maintained by the timing simulator 1210, which may serve as a minimum value of the local timestamps of all participating components simulated by functional simulators 1214. If a component's local timestamp is ahead of the global timestamp by more than a threshold value, the component may be stalled from further execution on the functional simulator. Once the global timestamp has progressed far enough (i.e., beyond the local timestamp for the component), the timing simulator may send the timing information back to the functional simulator, allowing it to continue simulation of the next set of instructions.

The points at which synchronization markers/messages are inserted into the dynamic instruction stream by the functional simulators may be determined based on the type of component being simulated by the functional simulator, which may allow for maximizing accuracy while maintaining a simulation speed as high as possible in those cases where exact synchronization is not required.

For example, in some embodiments, where a functional simulator is simulating a single threaded core (e.g., STCs 220 of FIG. 2B) as shown in FIG. 12, the synchronization marker (e.g., 1222) may be sent after a predetermined number of instructions (e.g., 1220), after which synchronization may be enforced, requiring a TCP round-trip. In some cases, the predetermined number may be on the order of hundreds (e.g., 100 or 200) to thousands of instructions.

Another example is shown in FIG. 13, which illustrates an example instruction communication flow between a multi-threaded functional simulator 1314 and a timing simulator 1312 in accordance with one or more embodiments. In the example shown, the functional simulator 1314 is simulating a multi-threaded core (e.g., MTCs 215 of FIG. 2B) and interleaves instructions (represented by the vertical bars) between each of the actively running threads 1316A-D. The interleaving may be performed up to a predetermined number of total instructions, which may be equal to the single threaded core synchronization parameter described above (which may be on the order of 100s to 1000s of instructions). Once the instruction count for all threads 1316 has reached the predetermined number, a single synchronization message may be sent to the timing simulator 1312.

The synchronization message may include the queue state for each thread running on the functional simulator 1314. This way, if the threads 1316 running on a single multi-threaded processor are not making progress at the same rate, the interleaving factor for the following interval (i.e., the time between synchronization markers/messages) can be adapted in such a way that slow threads (determined by the timing simulator 1312) run fewer instructions on the functional simulator 1314.

In some embodiments, the timing simulator 1312 may send a per-thread interleaving factor indicating a number of instructions per cycle (IPC) for each thread 1316 of the functional simulator 1314 as shown in FIG. 13. For example, as shown, the functional simulator 1314 in the first time interval shown in FIG. 13 executes two instructions per each thread 1316 in a round robin manner (in some embodiments, the functional simulator may default to an equal instruction distribution between threads and adjust from there based on the timing information received from the timing simulator). After receiving the timing and/or per-thread IPC information from the timing simulator 1312, the functional simulator 1314 in the second time interval shown in FIG. 13 executes three instructions in the thread 1316A, one instruction in the thread 1316B, four instructions in the thread 1316C, and zero instructions in the thread 1316D, in a round robin manner. The timing simulator 1312 may determine the thread interleaving factor based on timing information determined from a previous set of instructions. For instance, in the example shown, the interleaving factor may have defaulted to 2 instructions per cycle in the first set of instructions 1320A executed by the functional simulator 1314, and the timing simulator 1312 may have determined that thread 1316A is executing approximately three times as fast as thread 1316B. Thus, the timing simulator may provide the 3:1 instruction ratio in the interleaving factor sent for the second set of instructions 1320B.

As another example, where an accelerator function is to be simulated by a functional simulator, the accelerator function (e.g., a large data transfer) may be simulated as many separate instructions (e.g., instructions at the cache line granularity). This is because accelerators can do a large amount of work per instruction. For example, a single DMA instruction can initiate the transfer of MBs of memory, which can involve many different sub-instructions (e.g., load and store operations) for the single DMA instruction. In this example, the different load and store operations may be counted toward the predetermined number of instructions. Each simulated accelerator can have its own communication channel with the timing simulator 1312, and the functional simulator for the accelerator can send each operation through the channel in a way similar to how dynamic instructions are sent for a core being simulated and can add synchronization markers after a set number of accelerator operations (e.g., 1 or a few accelerator operations, since such operations may translate to many different underlying operations).

In certain embodiments, there may be certain types of instructions that can cause a forced synchronization to occur. For instance, where a functional simulator may be configured to send a synchronization message after 100 simulated instructions, if a particular type of instruction is issued after 10 instructions (as an example), the particular type of instruction may force synchronization of the functional simulator regardless of where the current instruction count (as described above) stands for the simulator. This includes instructions related to explicit communication between cores and accelerators (e.g., a wait-for-DMA instruction, an atomic operation, an enqueue/dequeue operation, memory scatter instruction (e.g., in a memory engine), etc.).

FIG. 14 illustrates a flow diagram of an example simulation process 1400 implemented by a functional simulator in accordance with one or more embodiments. The example process 1400 may include additional or different operations, and the operations may be performed in the order shown or in another order. In some cases, one or more of the operations shown in FIG. 14 may be implemented as processes that include multiple operations, sub-processes, or other types of routines. In some cases, operations can be combined, performed in another order, performed in parallel, iterated, or otherwise repeated or performed another manner. In certain embodiments, the operations may be encoded as instructions (e.g., software instructions) that are executable by one or more processors (e.g., processors of the computing node executing the functional simulator) and stored on computer-readable media.

At 1402, the functional simulator (e.g., 1114, 1214, 1314) simulates an instruction (e.g., 1220) for a component of a graph processing system (e.g., an STC, MTC, or accelerator), and at 1404, it sends information associated with the instruction to a timing simulator (e.g., 1112, 1212, 1312) over a network (e.g., 1120). The instruction information may be the instructions themselves, or may be one or more of instructions opcodes, data addresses associated with the opcodes, and/or the data values for the opcodes. The functional simulator then determines whether the instruction is a special instruction that requires a synchronization message to be sent (1406) or whether a predetermined number of instructions (N) have been simulated already (1408). The special type of instruction for purposes of the determination at 1406 may include wait-for-DMA instruction, an atomic operation, an enqueue/dequeue operation, or a memory scatter instruction. If the instruction simulated at 1402 is one of the special instruction types, or if N instructions have been simulated already, then the functional simulator sends a synchronization message to the timing simulator at 1410 and stalls the simulation/execution of any further instructions until it receives timing information back from the timing simulator at 1412. The timing information may include a number of cycles the timing simulator models as being required for the execution of the set of instructions. Where a multi-threaded core is being simulated by the functional simulator, the timing information may also include a thread interleaving factor, and the functional simulator may modify its execution of further instructions to comply with the interleaving factor provided (e.g., at 1416).

FIG. 15 illustrates a flow diagram of an example simulation process 1500 implemented by a timing simulator in accordance with one or more embodiments. The example process 1500 may include additional or different operations, and the operations may be performed in the order shown or in another order. In some cases, one or more of the operations shown in FIG. 15 may be implemented as processes that include multiple operations, sub-processes, or other types of routines. In some cases, operations can be combined, performed in another order, performed in parallel, iterated, or otherwise repeated or performed another manner. In certain embodiments, the operations may be encoded as instructions (e.g., software instructions) that are executable by one or more processors (e.g., processors of the computing node executing the functional simulator) and stored on computer-readable media.

At 1502, the timing simulator (e.g., 1112, 1212, 1312) receives instruction information from a plurality of functional simulators (e.g., 1114, 1214, 1314). The instruction information may be the instructions themselves, or may be one or more of instructions opcodes, data addresses associated with the opcodes, and/or the data values for the opcodes. At 1504, the timing simulator determines a number of cycles required for execution of the instructions provided at 1502, and in some embodiments, may determine a thread interleaving factor for multi-threaded cores being simulated (1506). At 1508, the timing simulator determines a global synchronization point for the functional simulators, which may be based on a comparison between local timestamps associated with each thread being simulated on the functional simulators and a global timestamp maintained by the timing simulator (e.g., as described above). The timing simulator then sends the timing information back to the respective functional simulators at the global synchronization point at 1510, after which the functional simulators may continue their simulations.

Example Computer Architectures

FIGS. 16-19 are block diagrams of example computer architectures. Other system designs and configurations known in the arts for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, micro controllers, cell phones, portable media players, handheld devices, and various other electronic devices, are also suitable. In general, a huge variety of systems or electronic devices capable of incorporating a processor and/or other execution logic as disclosed herein are generally suitable.

Referring now to FIG. 16, shown is a block diagram of a system 1600 in accordance with one embodiment of the present disclosure. The system 1600 may include one or more processors 1610, 1615, which are coupled to a controller hub 1620. In one embodiment the controller hub 1620 includes a graphics memory controller hub (GMCH) 1690 and an Input/Output Hub (IOH) 1650 (which may be on separate chips); the GMCH 1690 includes memory and graphics controllers to which are coupled memory 1640 and a coprocessor 1645; the IOH 1650 couples input/output (I/O) devices 1660 to the GMCH 1690. Alternatively, one or both of the memory and graphics controllers are integrated within the processor (as described herein), the memory 1640 and the coprocessor 1645 are coupled directly to the processor 1610, and the controller hub 1620 in a single chip with the IOH 1650. Memory 1640 may include application code 1640A, for example, to store code that when executed causes a processor to perform any method of this disclosure.

The optional nature of additional processors 1615 is denoted in FIG. 16 with broken lines. Each processor 1610, 1615 may include one or more of the processing cores described herein and may be some version of the processor 3000.

The memory 1640 may be, for example, dynamic random-access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1620 communicates with the processor(s) 1610, 1615 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface such as Quickpath Interconnect (QPI), or similar connection 1695.

In one embodiment, the coprocessor 1645 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1620 may include an integrated graphics accelerator.

There can be a variety of differences between the physical resources 1610, 1615 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.

In one embodiment, the processor 1610 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 1610 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1645. Accordingly, the processor 1610 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 1645. Coprocessor(s) 1645 accept and execute the received coprocessor instructions.

Referring now to FIG. 17, shown is a block diagram of a first more specific example system 1700 in accordance with an embodiment of the present disclosure. As shown in FIG. 17, multiprocessor system 1700 is a point-to-point interconnect system, and includes a first processor 1770 and a second processor 1780 coupled via a point-to-point interconnect 1750. Each of processors 1770 and 1780 may be some version of the processor 3000. In one embodiment of the disclosure, processors 1770 and 1780 are respectively processors 1610 and 1615, while coprocessor 1738 is coprocessor 1645. In another embodiment, processors 1770 and 1780 are respectively processor 1610 coprocessor 1645.

Processors 1770 and 1780 are shown including integrated memory controller (IMC) units 1772 and 1782, respectively. Processor 1770 also includes as part of its bus controller units point-to-point (P-P) interfaces 1776 and 1778; similarly, second processor 1780 includes P-P interfaces 1786 and 1788. Processors 1770, 1780 may exchange information via a point-to-point (P-P) interface 1750 using P-P interface circuits 1778, 1788. As shown in FIG. 17, IMCs 1772 and 1782 couple the processors to respective memories, namely a memory 1732 and a memory 1734, which may be portions of main memory locally attached to the respective processors.

Processors 1770, 1780 may each exchange information with a chipset 1790 via individual P-P interfaces 1752, 1754 using point to point interface circuits 1776, 1794, 1786, 1798. Chipset 1790 may optionally exchange information with the coprocessor 1738 via a high-performance interface 1739. In one embodiment, the coprocessor 1738 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.

A shared cache (not shown) may be included in either processor or outside of both processors, yet is connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.

Chipset 1790 may be coupled to a first bus 1716 via an interface 1796. In one embodiment, first bus 1716 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present disclosure is not so limited.

As shown in FIG. 17, various I/O devices 1714 may be coupled to first bus 1716, along with a bus bridge 1718 which couples first bus 1716 to a second bus 1720. In one embodiment, one or more additional processor(s) 1715, such as coprocessors, high-throughput MIC processors, GPGPU's, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processor, are coupled to first bus 1716. In one embodiment, second bus 1720 may be a low pin count (LPC) bus. Various devices may be coupled to a second bus 1720 including, for example, a keyboard and/or mouse 1722, communication devices 1727 and a storage unit 1728 such as a disk drive or other mass storage device which may include instructions/code and data 1730, in one embodiment. Further, an audio I/O 1724 may be coupled to the second bus 1720. Note that other architectures are possible. For example, instead of the point-to-point architecture of FIG. 17, a system may implement a multi-drop bus or other such architecture.

Referring now to FIG. 18, shown is a block diagram of a second more specific example system 1800 in accordance with an embodiment of the present disclosure. Like elements in FIGS. 17 and 18 bear like reference numerals, and certain aspects of FIG. 17 have been omitted from FIG. 18 in order to avoid obscuring other aspects of FIG. 18.

FIG. 18 illustrates that the processors 1770, 1780 may include integrated memory and I/O control logic (“CL”) 1772 and 1782, respectively. Thus, the CL 1772, 1782 include integrated memory controller units and include I/O control logic. FIG. 18 illustrates that not only are the memories 1732, 1734 coupled to the CL 1772, 1782, but also that I/O devices 1814 are also coupled to the control logic 1772, 1782. Legacy I/O devices 1815 are coupled to the chipset 1790.

Referring now to FIG. 19, shown is a block diagram of a SoC 1900 in accordance with an embodiment of the present disclosure. Similar elements in FIG. 30 bear like reference numerals. Also, dashed lined boxes are optional features on more advanced SoCs. In FIG. 19, an interconnect unit(s) 1902 is coupled to: an application processor 1910 which includes a set of one or more cores 2002A-N and shared cache unit(s) 2006; a system agent unit 2010; a bus controller unit(s) 2016; an integrated memory controller unit(s) 2014; a set or one or more coprocessors 1920 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; an static random access memory (SRAM) unit 1930; a direct memory access (DMA) unit 1932; and a display unit 1940 for coupling to one or more external displays. In one embodiment, the coprocessor(s) 1920 include a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.

Embodiments (e.g., of the mechanisms) disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the disclosure may be implemented as computer programs or program code executing on programmable systems including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code, such as code 1730 illustrated in FIG. 17, may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.

The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.

Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.

Accordingly, embodiments of the disclosure also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.

FIG. 20 is a block diagram of a processor 2000 that may have more than one core, may have an integrated memory controller, and may have integrated graphics according to embodiments of the disclosure. The solid lined boxes in FIG. 20 illustrate a processor 2000 with a single core 2002A, a system agent 2010, a set of one or more bus controller units 2016, while the optional addition of the dashed lined boxes illustrates an alternative processor 2000 with multiple cores 2002A-N, a set of one or more integrated memory controller unit(s) 2014 in the system agent unit 2010, and special purpose logic 2008.

Thus, different implementations of the processor 2000 may include: 1) a CPU with the special purpose logic 2008 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 2002A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 2002A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 2002A-N being a large number of general purpose in-order cores. Thus, the processor 2000 may be a general-purpose processor, coprocessor, or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 2000 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.

The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache units 2006, and external memory (not shown) coupled to the set of integrated memory controller units 2014. The set of shared cache units 2006 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring-based interconnect unit 2012 interconnects the integrated graphics logic 2008, the set of shared cache units 2006, and the system agent unit 2010/integrated memory controller unit(s) 2014, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 2006 and cores 2002-A-N.

In some embodiments, one or more of the cores 2002A-N are capable of multi-threading. The system agent 2010 includes those components coordinating and operating cores 2002A-N. The system agent unit 2010 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 2002A-N and the integrated graphics logic 2008. The display unit is for driving one or more externally connected displays.

The cores 2002A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 2002A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.

Emulation (Including Binary Translation, Code Morphing, Etc.)

In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.

FIG. 21 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the disclosure. In the illustrated embodiment, the instruction converter is a software instruction converter, although alternatively the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. FIG. 21 shows a program in a high-level language 2102 may be compiled using an x86 compiler 2104 to generate x86 binary code 2106 that may be natively executed by a processor with at least one x86 instruction set core 2116. The processor with at least one x86 instruction set core 2116 represents any processor that can perform substantially the same functions as an Intel® processor with at least one x86 instruction set core by compatibly executing or otherwise processing (1) a substantial portion of the instruction set of the Intel® x86 instruction set core or (2) object code versions of applications or other software targeted to run on an Intel® processor with at least one x86 instruction set core, in order to achieve substantially the same result as an Intel® processor with at least one x86 instruction set core. The x86 compiler 2104 represents a compiler that is operable to generate x86 binary code 2106 (e.g., object code) that can, with or without additional linkage processing, be executed on the processor with at least one x86 instruction set core 2116. Similarly, FIG. 21 shows the program in the high level language 2102 may be compiled using an alternative instruction set compiler 2108 to generate alternative instruction set binary code 2110 that may be natively executed by a processor without at least one x86 instruction set core 2114 (e.g., a processor with cores that execute the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif. and/or that execute the ARM instruction set of ARM Holdings of Sunnyvale, Calif.). The instruction converter 2112 is used to convert the x86 binary code 2106 into code that may be natively executed by the processor without an x86 instruction set core 2114. This converted code is not likely to be the same as the alternative instruction set binary code 2110 because an instruction converter capable of this is difficult to make; however, the converted code will accomplish the general operation and be made up of instructions from the alternative instruction set. Thus, the instruction converter 2112 represents software, firmware, hardware, or a combination thereof that, through emulation, simulation, or any other process, allows a processor or other electronic device that does not have an x86 instruction set processor or core to execute the x86 binary code 2106.

“Logic” (e.g., as found in offload engines, memory managers, memory controllers, network controllers, etc. and other references to logic in this application) may refer to hardware, firmware, software and/or combinations of each to perform one or more functions. In various embodiments, logic may include a microprocessor or other processing element operable to execute software instructions, discrete logic such as an application specific integrated circuit (ASIC), a programmed logic device such as a field programmable gate array (FPGA), a memory device containing instructions, combinations of logic devices (e.g., as would be found on a printed circuit board), or other suitable hardware and/or software. Logic may include one or more gates or other circuit components. In some embodiments, logic may also be fully embodied as software.

Use of the phrase ‘to’ or ‘configured to,’ in one embodiment, refers to arranging, putting together, manufacturing, offering to sell, importing and/or designing an apparatus, hardware, logic, or element to perform a designated or determined task. In this example, an apparatus or element thereof that is not operating is still ‘configured to’ perform a designated task if it is designed, coupled, and/or interconnected to perform said designated task. As a purely illustrative example, a logic gate may provide a 0 or a 1 during operation. But a logic gate ‘configured to’ provide an enable signal to a clock does not include every potential logic gate that may provide a 1 or 0. Instead, the logic gate is one coupled in some manner that during operation the 1 or 0 output is to enable the clock. Note once again that use of the term ‘configured to’ does not require operation, but instead focus on the latent state of an apparatus, hardware, and/or element, where in the latent state the apparatus, hardware, and/or element is designed to perform a particular task when the apparatus, hardware, and/or element is operating.

Furthermore, use of the phrases ‘capable of/to,’ and or ‘operable to,’ in one embodiment, refers to some apparatus, logic, hardware, and/or element designed in such a way to enable use of the apparatus, logic, hardware, and/or element in a specified manner. Note as above that use of to, capable to, or operable to, in one embodiment, refers to the latent state of an apparatus, logic, hardware, and/or element, where the apparatus, logic, hardware, and/or element is not operating but is designed in such a manner to enable use of an apparatus in a specified manner.

A value, as used herein, includes any known representation of a number, a state, a logical state, or a binary logical state. Often, the use of logic levels, logic values, or logical values is also referred to as 1's and 0's, which simply represents binary logic states. For example, a 1 refers to a high logic level and 0 refers to a low logic level. In one embodiment, a storage cell, such as a transistor or flash cell, may be capable of holding a single logical value or multiple logical values. However, other representations of values in computer systems have been used. For example, the decimal number ten may also be represented as a binary value of 418A0 and a hexadecimal letter A. Therefore, a value includes any representation of information capable of being held in a computer system.

Moreover, states may be represented by values or portions of values. As an example, a first value, such as a logical one, may represent a default or initial state, while a second value, such as a logical zero, may represent a non-default state. In addition, the terms reset and set, in one embodiment, refer to a default and an updated value or state, respectively. For example, a default value potentially includes a high logical value, i.e. reset, while an updated value potentially includes a low logical value, i.e. set. Note that any combination of values may be utilized to represent any number of states.

The embodiments of methods, hardware, software, firmware, or code set forth above may be implemented via instructions or code stored on a machine-accessible, machine readable, computer accessible, or computer readable medium which are executable by a processing element. A non-transitory machine-accessible/readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine, such as a computer or electronic system. For example, a non-transitory machine-accessible medium includes random-access memory (RAM), such as static RAM (SRAM) or dynamic RAM (DRAM); ROM; magnetic or optical storage medium; flash memory devices; electrical storage devices; optical storage devices; acoustical storage devices; other form of storage devices for holding information received from transitory (propagated) signals (e.g., carrier waves, infrared signals, digital signals); etc., which are to be distinguished from the non-transitory mediums that may receive information there from.

Instructions used to program logic to perform embodiments of the disclosure may be stored within a memory in the system, such as DRAM, cache, flash memory, or other storage. Furthermore, the instructions can be distributed via a network or by way of other computer readable media. Thus a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), but is not limited to, floppy diskettes, optical disks, Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks, Read-Only Memory (ROMs), Random Access Memory (RAM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic or optical cards, flash memory, or a tangible, machine-readable storage used in the transmission of information over the Internet via electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.). Accordingly, the computer-readable medium includes any type of tangible machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

The following examples pertain to embodiments in accordance with this Specification.

Example 1 is a distributed simulation system comprising: a first computing node to execute a timing simulator of a graph processing system; and a set of second computing nodes coupled to the first computing node over a network, each second computing node to execute a functional simulator of the graph processing system; wherein: each functional simulator is to simulate execution of a set of instructions on the graph processing system and send information associated with the simulated set of instructions to the timing simulator over the network; and the timing simulator is to determine timing information associated with execution of the sets of instructions sent by the functional simulators, and send the timing information to the functional simulators over the network.

Example 2 includes the subject matter of Example 1, wherein at least one functional simulator is to execute a simulation of a component of the graph processing system, and to send instruction information to the timing simulator after a predetermined number of instructions have been simulated by the functional simulator for the component.

Example 3 includes the subject matter of Example 2, wherein the component is a single threaded core of the graph processing system.

Example 4 includes the subject matter of Example 2, wherein the component is a multi-threaded core of the graph processing system, and the functional simulator is to send instruction information to the timing simulator after a predetermined number of instructions have been simulated by functional simulator for all of the threads the multi-threaded core in aggregate.

Example 5 includes the subject matter of Example 4, wherein the timing simulator is to determine a thread interleaving factor for the multi-threaded core and send the interleaving factor to the functional simulator, the thread interleaving factor indicating a number of instructions to simulate for each thread of the multi-threaded core per cycle.

Example 6 includes the subject matter of Example 2, wherein the component is an accelerator of the graph processing system, and the functional simulator is to send instruction information to the timing simulator after a predetermined number of sub-instructions of an accelerator instruction have been executed.

Example 7 includes the subject matter of Example 2, wherein the functional simulator is to send the instruction information to the timing simulator, without waiting for the predetermined number of instructions to be simulated, based on one or more of a wait-for-DMA instruction, an atomic operation, an enqueue/dequeue operation, and a memory scatter instruction being simulated on the functional simulator.

Example 8 includes the subject matter of any one of Examples 1-7, wherein the timing information for each set of instructions includes a number of cycles required to execute the set of instructions on the graph processing system.

Example 9 includes the subject matter of any one of Examples 1-8, wherein: the timing simulator is to determine a global synchronization point for the functional simulators, and send the timing information for the sets of instructions to each respective functional simulator at the global synchronization point; and each functional simulator is to stall simulation of further instructions until the timing information for its set of instructions is received from the timing simulator.

Example 10 includes the subject matter of Example 9, wherein the timing simulator is to maintain a local timestamp for each thread being simulated on a functional simulator and a global timestamp, and to determine the global synchronization point based on comparisons between the local timestamps and the global timestamp.

Example 11 includes the subject matter of any one of Examples 1-10, wherein the instruction information includes instruction opcodes and data addresses associated with the opcodes.

Example 12 includes the subject matter of Example 11, wherein the instruction information further includes data values.

Example 13 includes the subject matter of any one of Examples 1-10, wherein the instruction information includes the instructions.

Example 14 includes the subject matter of any one of Examples 1-13, wherein the set of second computing nodes are to communicate with one another via a Message Passing Interface (MPI).

Example 14.5 includes the subject matter of any one of Examples 1-14, wherein the first computing node does not execute a functional simulator and the second computing nodes do not execute timing simulators.

Example 15 includes the subject matter of any one of Examples 1-14.5, wherein the network is an Ethernet-based network or Infiniband-based network.

Example 16 is a method comprising: at each of a plurality of functional simulators, simulating execution of instructions on components of a graph processing system and sending information associated with the simulated instructions to a timing simulator; and at a timing simulator, determining timing information associated with execution of the instructions sent by the functional simulators and sending the timing information to the functional simulators.

Example 17 includes the subject matter of Example 16, wherein the instruction information is sent to the timing simulator by the functional simulators after a predetermined number of instructions have been simulated by the functional simulator for a component.

Example 18 includes the subject matter of Example 17, wherein simulating execution of instructions comprises simulating a multi-threaded core of the graph processing system, and the method further comprises sending, by the functional simulator simulating the multi-threaded core, instruction information to the timing simulator after a predetermined number of instructions have been simulated all of the threads the multi-threaded core in aggregate.

Example 19 includes the subject matter of Example 18, further comprising determining, by the timing simulator, a thread interleaving factor for the multi-threaded core and sending, by the timing simulator, the interleaving factor to the functional simulator, the thread interleaving factor indicating a number of instructions to simulate for each thread of the multi-threaded core per cycle.

Example 20 includes the subject matter of any one of Examples 16-19, further comprising: determining, by the timing simulator, a global synchronization point for the functional simulators, and send the timing information for the sets of instructions to each respective functional simulator at the global synchronization point; and at each functional simulator, stalling simulation of further instructions until the timing information is received from the timing simulator.

Example 21 includes the subject matter of Example 20, further comprising maintaining, by the timing simulator, a local timestamp for each thread being simulated on a functional simulator and a global timestamp, wherein the global synchronization point is determined based on comparisons between the local timestamps and the global timestamp.

Example 22 includes the subject matter of any one of Examples 16-21, wherein the instruction information includes instruction opcodes and data addresses associated with the opcodes.

Example 23 includes the subject matter of Example 22, wherein the instruction information further includes data values.

Example 24 includes the subject matter of any one of Examples 16-21, wherein the instruction information includes the instructions.

Example 25 includes the subject matter of any one of Examples 16-24, wherein the functional simulators and the timing simulator communicate with one another via a Message Passing Interface (MPI).

Example 26 includes the subject matter of any one of Examples 16-25, wherein the functional simulators and the timing simulator are connected to one another over an Ethernet-based network or an Infiniband-based network.

Example 27 includes one or more non-transitory computer-readable media comprising instructions that, when executed by one or more processors, cause the processors to implement any one of Examples 16-26.

Example 28 includes an apparatus comprising means to implement any one of Examples 16-26.

Example 29 includes one or more non-transitory computer-readable media comprising instructions that, when executed by processors of a distributed simulation system, cause the processors to: receive, from a plurality of functional simulators, information associated with the simulated execution of instructions on components of a graph processing system; determine timing information associated with execution of the instructions sent by the functional simulators; determine a global synchronization point for the functional simulators; and send the timing information for the sets of instructions to each respective functional simulator at the global synchronization point.

Example 30 includes the subject matter of Example 29, wherein the instructions are further to determine a thread interleaving factor for a multi-threaded core being simulated by a functional simulator and send the interleaving factor to the functional simulator, the thread interleaving factor indicating a quantity of instructions to simulate for each thread of the multi-threaded core per cycle.

Example 31 includes the subject matter of Example 29, wherein the instructions are to maintain a local timestamp for each thread being simulated on a functional simulator and a global timestamp, and wherein the global synchronization point is determined based on comparisons between the local timestamps and the global timestamp.

Example 32 includes one or more non-transitory computer-readable media comprising instructions that, when executed by processors of a distributed simulation system, cause the processors to: simulate execution of instructions on a component of a graph processing system; after a predetermined quantity of instructions have been simulated for the component, cause information associated with the simulated instructions to be sent to a timing simulator over a network; and stall simulation of further instructions until timing information is received from the timing simulator for the quantity of instructions.

Example 33 includes the subject matter of Example 32, wherein simulating execution of instructions comprises simulating a multi-threaded core of the graph processing system, and the instructions are further to send the instruction information to the timing simulator after a predetermined number of instructions have been simulated all of the threads the multi-threaded core in aggregate.

Example 34 includes the subject matter of Example 32, wherein the instruction information includes instruction opcodes and data addresses associated with the opcodes.

Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the foregoing specification, a detailed description has been given with reference to specific example embodiments. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense. Furthermore, the foregoing use of embodiment and other exemplarily language does not necessarily refer to the same embodiment or the same example, but may refer to different and distinct embodiments, as well as potentially the same embodiment. 

What is claimed is:
 1. A distributed simulation system comprising: a first computing node to execute a timing simulator of a graph processing system; and one or more second computing nodes coupled to the first computing node over a network, the second computing nodes to execute functional simulators of a graph processing system; wherein: the functional simulators are to simulate execution of instructions on the graph processing system and send information associated with the simulated instructions to the timing simulator over the network; and the timing simulator is to determine timing information associated with execution of the instructions sent by the functional simulators, and send the timing information to the functional simulators over the network.
 2. The system of claim 1, wherein at least one functional simulator is to execute a simulation of a component of the graph processing system, and to send instruction information to the timing simulator after a predetermined number of instructions have been simulated by the functional simulator for the component.
 3. The system of claim 2, wherein the component is a single threaded core of the graph processing system.
 4. The system of claim 2, wherein the component is a multi-threaded core of the graph processing system, and the functional simulator is to send instruction information to the timing simulator after a predetermined number of instructions have been simulated by functional simulator for all of the threads of the multi-threaded core in aggregate.
 5. The system of claim 4, wherein the timing simulator is to determine a thread interleaving factor for the multi-threaded core and send the interleaving factor to the functional simulator, the thread interleaving factor indicating a quantity of instructions to simulate for each thread of the multi-threaded core per cycle.
 6. The system of claim 2, wherein the component is an accelerator of the graph processing system, and the functional simulator is to send instruction information to the timing simulator after a predetermined quantity of sub-instructions of an accelerator instruction have been executed.
 7. The system of claim 2, wherein the functional simulator is to send the instruction information to the timing simulator, without waiting for the predetermined quantity of instructions to be simulated, based on one or more of a wait-for-DMA instruction, an atomic operation, an enqueue/dequeue operation, and a memory scatter instruction being simulated on the functional simulator.
 8. The system of claim 1, wherein the timing information for respective sets of instructions includes a quantity of cycles required to execute the set of instructions on the graph processing system.
 9. The system of claim 1, wherein: the timing simulator is to determine a global synchronization point for a plurality of functional simulators, and send the timing information for the sets of instructions to the respective functional simulators at the global synchronization point; and the functional simulator are to stall simulation of further instructions until the timing information for the its set of instructions is received from the timing simulator.
 10. The system of claim 9, wherein the timing simulator is to maintain a local timestamp for each thread being simulated on a functional simulator and a global timestamp, and to determine the global synchronization point based on comparisons between the local timestamps and the global timestamp.
 11. The system of claim 1, wherein the instruction information includes instruction opcodes and data addresses associated with the opcodes.
 12. The system of claim 11, wherein the instruction information further includes data values.
 13. The system of claim 1, wherein the instruction information includes the instructions.
 14. The system of claim 1, wherein the second computing nodes are to communicate with one another via a Message Passing Interface (MPI).
 15. The system of claim 1, wherein the first computing node does not execute a functional simulator and the second computing nodes do not execute timing simulators.
 16. The system of claim 1, wherein the network is an Ethernet-based network or Infiniband-based network.
 17. A method comprising: at a plurality of functional simulators, simulating execution of instructions on components of a graph processing system and sending information associated with the simulated instructions to a timing simulator; and at a timing simulator, determining timing information associated with execution of the instructions sent by the functional simulators and sending the timing information to the functional simulators.
 18. The method of claim 17, further comprising: determining, by the timing simulator, a global synchronization point for the functional simulators, and send the timing information for the sets of instructions to respective functional simulators at the global synchronization point; and at the functional simulators, stalling simulation of further instructions until the timing information is received from the timing simulator.
 19. The method of claim 17, further comprising maintaining, by the timing simulator, a local timestamp for each thread being simulated on a functional simulator and a global timestamp, wherein the global synchronization point is determined based on comparisons between the local timestamps and the global timestamp.
 20. One or more non-transitory computer-readable media comprising instructions that, when executed by processors of a distributed simulation system, cause the processors to: receive, from a plurality of functional simulators, information associated with the simulated execution of instructions on components of a graph processing system; determine timing information associated with execution of the instructions sent by the functional simulators; determine a global synchronization point for the functional simulators; and send the timing information for the sets of instructions to respective functional simulators at the global synchronization point.
 21. The computer-readable media of claim 20, wherein the instructions are further to determine a thread interleaving factor for a multi-threaded core being simulated by a functional simulator and send the interleaving factor to the functional simulator, the thread interleaving factor indicating a quantity of instructions to simulate for each thread of the multi-threaded core per cycle.
 22. The computer-readable media of claim 20, wherein the instructions are to maintain a local timestamp for each thread being simulated on a functional simulator and a global timestamp, and wherein the global synchronization point is determined based on comparisons between the local timestamps and the global timestamp.
 23. One or more non-transitory computer-readable media comprising instructions that, when executed by processors of a distributed simulation system, cause the processors to: simulate execution of instructions on a component of a graph processing system; after a predetermined quantity of instructions have been simulated for the component, cause information associated with the simulated instructions to be sent to a timing simulator over a network; and stall simulation of further instructions until timing information is received from the timing simulator for the quantity of instructions.
 24. The computer-readable media of claim 23, wherein simulating execution of instructions comprises simulating a multi-threaded core of the graph processing system, and the instructions are further to send the instruction information to the timing simulator after a predetermined quantity of instructions have been simulated all of the threads the multi-threaded core in aggregate.
 25. The computer-readable media of claim 23, wherein the instruction information includes instruction opcodes and data addresses associated with the opcodes. 